PySpark: Check if Column Exists in DataFrame


You can use the following methods in PySpark to check if a particular column exists in a DataFrame:

Method 1: Check if Column Exists (Case-Sensitive)

'points' in df.columns

Method 2: Check if Column Exists (Not Case-Sensitive)

'points'.upper() in (name.upper() for name in df.columns)

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', None, 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', None, 12], 
        ['B', 'West', None, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      null|     8|      9|
|   A|      East|    10|      3|
|   B|      West|  null|     12|
|   B|      West|  null|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Check if Column Exists (Case-Sensitive)

We can use the following syntax to check if the column name points exists in the DataFrame:

#check if column name 'points' exists in the DataFrame
'points' in df.columns

True

The output returns True since the column name points does indeed exist in the DataFrame.

Note that this syntax is case-sensitive so if we search instead for the column name Points then we will receive an output of False since the case we searched for doesn’t precisely match the case of the column name in the DataFrame:

#check if column name 'Points' exists in the DataFrame
'Points' in df.columns

False

Example 2: Check if Column Exists (Not Case-Sensitive)

We can use the following syntax to check if the column name Points exists in the DataFrame:

#check if column name 'Points' exists in the DataFrame
'Points'.upper() in (name.upper() for name in df.columns) 

True

The output returns True even though the case of the column name that we searched for didn’t precisely match the column name of points in the DataFrame.

This allowed us to perform a case-insensitive search.

x