How can I create a new column in PySpark only if it doesn’t already exist?

Creating a new column in PySpark can be done using the “withColumn” function. However, if the column already exists, it will result in an error. To avoid this, one can use the “withColumnRenamed” function to check if the column exists and then create a new column only if it doesn’t already exist. This approach ensures that the code runs smoothly without any interruptions due to duplicate column names.

PySpark: Create Column If It Doesn’t Exist


You can use the following syntax to create a column in a PySpark DataFrame only if it doesn’t already exist:

import pyspark.sql.functions as F

#add 'points' column to DataFrame if it doesn't already exist
if 'points' not in df.columns:
    df = df.withColumn('points', F.lit('100'))

This particular example attempts to create a column named points and assign a value of 100 to each row in the column, only if a column named points doesn’t already exist.

The following example shows how to use this syntax in practice.

Example: How to Create Column If It Doesn’t Exist in PySpark

Suppose we have the following PySpark DataFrame with two columns named team and points:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12], 
        ['Kings', 15], 
        ['Hawks', 19],
        ['Wizards', 24],
        ['Magic', 28],
        ['Jazz', 40],
        ['Thunder', 24],
        ['Spurs', 13]]
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|  Kings|    15|
|  Hawks|    19|
|Wizards|    24|
|  Magic|    28|
|   Jazz|    40|
|Thunder|    24|
|  Spurs|    13|
+-------+------+

Suppose we use the following syntax to attempt to add a new column named points:

import pyspark.sql.functions as F

#add 'points' column to DataFrame if it doesn't already exist
if 'points' not in df.columns:
    df = df.withColumn('points', F.lit('100'))

#view updated DataFrame
df.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|  Kings|    15|
|  Hawks|    19|
|Wizards|    24|
|  Magic|    28|
|   Jazz|    40|
|Thunder|    24|
|  Spurs|    13|
+-------+------+

Since a column named points already exists in the DataFrame, a new column was not added.

The points column that already exists remained unchanged.

However, suppose we attempt to add a new column named assists if it doesn’t already exist:

import pyspark.sql.functions as F

#add 'assists' column to DataFrame if it doesn't already exist
if 'assists' not in df.columns:
    df = df.withColumn('assists', F.lit('100'))

#view updated DataFrame
df.show()

+-------+------+-------+
|   team|points|assists|
+-------+------+-------+
|   Mavs|    18|    100|
|   Nets|    33|    100|
| Lakers|    12|    100|
|  Kings|    15|    100|
|  Hawks|    19|    100|
|Wizards|    24|    100|
|  Magic|    28|    100|
|   Jazz|    40|    100|
|Thunder|    24|    100|
|  Spurs|    13|    100|
+-------+------+-------+

Since a column named assists did not already exist in the DataFrame, this new column was added to the DataFrame.

Note that we used the lit function to assign a literal value of 100 to each row in this new assists column.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x