How can columns be excluded in PySpark?


You can use the following methods to exclude specific columns in a PySpark DataFrame:

Method 1: Exclude One Column

#select all columns except 'points' column
df_new = df.drop('points')

Method 2: Exclude Multiple Columns

#select all columns except 'conference' and 'points' columns
df_new = df.drop('conference', 'points')

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Exclude One Column in PySpark

We can use the following syntax to select all columns in the DataFrame, excluding the points column:

#select all columns except 'points' column
df_new = df.drop('points')

#view new DataFrame
df_new.show()

+----+----------+-------+
|team|conference|assists|
+----+----------+-------+
|   A|      East|      4|
|   A|      East|      9|
|   A|      East|      3|
|   B|      West|     12|
|   B|      West|      4|
|   C|      East|      2|
+----+----------+-------+

Notice that all columns in the DataFrame are selected except for the points column.

Example 2: Exclude Multiple Columns in PySpark

We can use the following syntax to select all columns in the DataFrame, excluding the conference and points column:

#select all columns except 'conference' and 'points' columns
df_new = df.drop('conference', 'points')

#view new DataFrame
df_new.show()

+----+-------+
|team|assists|
+----+-------+
|   A|      4|
|   A|      9|
|   A|      3|
|   B|     12|
|   B|      4|
|   C|      2|
+----+-------++

Notice that all columns in the DataFrame are selected except for the conference and points columns.

Note: You can find the complete documentation for the PySpark drop function .

Additional Resources

x