How can I drop multiple columns from a PySpark DataFrame?

Dropping multiple columns from a PySpark DataFrame can be achieved by using the “drop” method, which takes a list of column names as its argument. This method removes the specified columns from the DataFrame and returns a new DataFrame with the remaining columns. It is a convenient and efficient way to manipulate data in a PySpark DataFrame without having to create a new DataFrame. This approach allows for easy and streamlined data manipulation for data analysis and processing tasks.

PySpark: Drop Multiple Columns from DataFrame


There are two common ways to drop multiple columns in a PySpark DataFrame:

Method 1: Drop Multiple Columns by Name

#drop 'team' and 'points' columnsdf.drop('team', 'points').show()

Method 2: Drop Multiple Columns Based on List

#define list of columns to drop
drop_cols = ['team', 'points']

#drop all columns in list 
df.select(*drop_cols).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Drop Multiple Columns by Name

We can use the following syntax to drop the team and points columns from the DataFrame:

#drop 'team' and 'points' columns
df.drop('team', 'points').show()

+----------+-------+
|conference|assists|
+----------+-------+
|      East|      4|
|      East|      9|
|      East|      3|
|      West|     12|
|      West|      4|
|      East|      2|
+----------+-------+

Notice that the team and points columns have both been dropped from the DataFrame, just as we specified.

Example 2: Drop Multiple Columns Based on List

We can use the following syntax to specify a list of column names and then drop all columns in the DataFrame that belong to the list:

#define list of columns to drop
drop_cols = ['team', 'points']

#drop all columns in list
df.drop(*drop_cols).show()

+----------+-------+
|conference|assists|
+----------+-------+
|      East|      4|
|      East|      9|
|      East|      3|
|      West|     12|
|      West|      4|
|      East|      2|
+----------+-------+

Notice that the resulting DataFrame drops each of the column names that we specified in the list.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x