How can I select all columns in PySpark except for specific ones?

To select all columns in PySpark except for specific ones, the user can use the “drop” function with the list of columns to be excluded as parameters. This function will remove the specified columns from the dataframe and return the remaining columns. This approach allows for a simple and efficient way to select all columns except for the ones that are not needed for analysis or processing. By using the “drop” function, the user can easily manipulate and filter the data without the need for multiple select statements or complex coding.

PySpark: Select All Columns Except Specific Ones


The easiest way to select all columns except specific ones in a PySpark DataFrame is by using the drop function.

Here are two common ways to do so:

Method 1: Select All Columns Except One

#select all columns except 'conference' columndf.drop('conference').show()

Method 2: Select All Columns Except Several Specific Ones

#select all columns except 'conference' and 'assists' columns
df.drop('conference', 'assists').show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Select All Columns Except One

We can use the following syntax to select all columns in the DataFrame except for the conference column:

#select all columns except 'conference' column
df.drop('conference').show()

+----+------+-------+
|team|points|assists|
+----+------+-------+
|   A|    11|      4|
|   A|     8|      9|
|   A|    10|      3|
|   B|     6|     12|
|   B|     6|      4|
|   C|     5|      2|
+----+------+-------+

Notice that the resulting DataFrame contains all columns from the original DataFrame except for the conference column.

Example 2: Select All Columns Except Several Specific Ones

We can use the following syntax to select all columns in the DataFrame except for the conference and assists columns:

#select all columns except 'conference' and 'assists' column
df.drop('conference', 'assists').show()

+----+------+
|team|points|
+----+------+
|   A|    11|
|   A|     8|
|   A|    10|
|   B|     6|
|   B|     6|
|   C|     5|
+----+------+

Notice that the resulting DataFrame contains all columns from the original DataFrame except for the conference and assists columns.

Additional Resources

x