Table of Contents
Creating a new column in PySpark can be done using the “withColumn” function. However, if the column already exists, it will result in an error. To avoid this, one can use the “withColumnRenamed” function to check if the column exists and then create a new column only if it doesn’t already exist. This approach ensures that the code runs smoothly without any interruptions due to duplicate column names.
PySpark: Create Column If It Doesn’t Exist
You can use the following syntax to create a column in a PySpark DataFrame only if it doesn’t already exist:
import pyspark.sql.functions as F
#add 'points' column to DataFrame if it doesn't already exist
if 'points' not in df.columns:
df = df.withColumn('points', F.lit('100'))
This particular example attempts to create a column named points and assign a value of 100 to each row in the column, only if a column named points doesn’t already exist.
The following example shows how to use this syntax in practice.
Example: How to Create Column If It Doesn’t Exist in PySpark
Suppose we have the following PySpark DataFrame with two columns named team and points:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['Mavs', 18], ['Nets', 33], ['Lakers', 12], ['Kings', 15], ['Hawks', 19], ['Wizards', 24], ['Magic', 28], ['Jazz', 40], ['Thunder', 24], ['Spurs', 13]] #define column names columns = ['team', 'points'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +-------+------+ | team|points| +-------+------+ | Mavs| 18| | Nets| 33| | Lakers| 12| | Kings| 15| | Hawks| 19| |Wizards| 24| | Magic| 28| | Jazz| 40| |Thunder| 24| | Spurs| 13| +-------+------+
Suppose we use the following syntax to attempt to add a new column named points:
import pyspark.sql.functions as F
#add 'points' column to DataFrame if it doesn't already exist
if 'points' not in df.columns:
df = df.withColumn('points', F.lit('100'))
#view updated DataFrame
df.show()
+-------+------+
| team|points|
+-------+------+
| Mavs| 18|
| Nets| 33|
| Lakers| 12|
| Kings| 15|
| Hawks| 19|
|Wizards| 24|
| Magic| 28|
| Jazz| 40|
|Thunder| 24|
| Spurs| 13|
+-------+------+
Since a column named points already exists in the DataFrame, a new column was not added.
The points column that already exists remained unchanged.
However, suppose we attempt to add a new column named assists if it doesn’t already exist:
import pyspark.sql.functions as F
#add 'assists' column to DataFrame if it doesn't already exist
if 'assists' not in df.columns:
df = df.withColumn('assists', F.lit('100'))
#view updated DataFrame
df.show()
+-------+------+-------+
| team|points|assists|
+-------+------+-------+
| Mavs| 18| 100|
| Nets| 33| 100|
| Lakers| 12| 100|
| Kings| 15| 100|
| Hawks| 19| 100|
|Wizards| 24| 100|
| Magic| 28| 100|
| Jazz| 40| 100|
|Thunder| 24| 100|
| Spurs| 13| 100|
+-------+------+-------+
Since a column named assists did not already exist in the DataFrame, this new column was added to the DataFrame.
Note that we used the lit function to assign a literal value of 100 to each row in this new assists column.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark: