How can I check if a column exists in a PySpark DataFrame?

To check if a column exists in a PySpark DataFrame, you can use the “in” operator. This operator allows you to check if a specific column name is present in the DataFrame’s list of columns. If the column exists, the operator will return a Boolean value of True, otherwise it will return False. Additionally, you can also use the “select” method to retrieve the list of columns from the DataFrame and then use the “contains” function to check if the desired column name is present. This method is useful for ensuring the presence of required columns before performing any data manipulations or analysis.

PySpark: Check if Column Exists in DataFrame


You can use the following methods in PySpark to check if a particular column exists in a DataFrame:

Method 1: Check if Column Exists (Case-Sensitive)

'points' in df.columns

Method 2: Check if Column Exists (Not Case-Sensitive)

'points'.upper() in (name.upper() for name in df.columns)

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', None, 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', None, 12], 
        ['B', 'West', None, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      null|     8|      9|
|   A|      East|    10|      3|
|   B|      West|  null|     12|
|   B|      West|  null|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Check if Column Exists (Case-Sensitive)

We can use the following syntax to check if the column name points exists in the DataFrame:

#check if column name 'points' exists in the DataFrame
'points' in df.columns

True

The output returns True since the column name points does indeed exist in the DataFrame.

Note that this syntax is case-sensitive so if we search instead for the column name Points then we will receive an output of False since the case we searched for doesn’t precisely match the case of the column name in the DataFrame:

#check if column name 'Points' exists in the DataFrame
'Points' in df.columns

False

Example 2: Check if Column Exists (Not Case-Sensitive)

We can use the following syntax to check if the column name Points exists in the DataFrame:

#check if column name 'Points' exists in the DataFrame
'Points'.upper() in (name.upper() for name in df.columns) 

True

The output returns True even though the case of the column name that we searched for didn’t precisely match the column name of points in the DataFrame.

This allowed us to perform a case-insensitive search.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x