Add Multiple Columns to PySpark DataFrame


You can use the following methods to add multiple new columns to a PySpark DataFrame:

Method 1: Add Multiple Empty Columns

from pyspark.sql.functions import lit

#add three empty columns
for col in ['new_col1', 'new_col2', 'new_col3']:
    df = df.withColumn(col, lit(None))

Method 2: Add Multiple Columns Based on Existing Columns

#add three new columns based on values in 'points' columns
df = df.withColumn('points2', df.points*2)
       .withColumn('points3', df.points*3)
       .withColumn('points_half', df.points/2)

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['B', 'Forward', 7]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

Example 1: Add Multiple Empty Columns

We can use the following syntax to add three new empty columns to the existing DataFrame:

from pyspark.sql.functions import lit

#add three empty columns
for col in ['new_col1', 'new_col2', 'new_col3']:
    df = df.withColumn(col, lit(None))

#view updated DataFrame
df.show()

+----+--------+------+--------+--------+--------+
|team|position|points|new_col1|new_col2|new_col3|
+----+--------+------+--------+--------+--------+
|   A|   Guard|    11|    null|    null|    null|
|   A|   Guard|     8|    null|    null|    null|
|   A| Forward|    22|    null|    null|    null|
|   A| Forward|    22|    null|    null|    null|
|   B|   Guard|    14|    null|    null|    null|
|   B|   Guard|    14|    null|    null|    null|
|   B| Forward|    13|    null|    null|    null|
|   B| Forward|     7|    null|    null|    null|
+----+--------+------+--------+--------+--------+

Notice that three new columns with the names new_col1, new_col2 and new_col3 have all been added to the existing DataFrame.

Since we use the lit function to specify a literal value of None, the values in each new column are simply all null values.

Example 2: Add Multiple Columns Based on Existing Columns

We can use the following syntax to add three new columns whose values are all based on the values in the existing points column of the DataFrame:

#add three new columns based on values in 'points' columns
df = df.withColumn('points2', df.points*2)
       .withColumn('points3', df.points*3)
       .withColumn('points_half', df.points/2)

#view updated DataFrame
df.show()

+----+--------+------+-------+-------+-----------+
|team|position|points|points2|points3|points_half|
+----+--------+------+-------+-------+-----------+
|   A|   Guard|    11|     22|     33|        5.5|
|   A|   Guard|     8|     16|     24|        4.0|
|   A| Forward|    22|     44|     66|       11.0|
|   A| Forward|    22|     44|     66|       11.0|
|   B|   Guard|    14|     28|     42|        7.0|
|   B|   Guard|    14|     28|     42|        7.0|
|   B| Forward|    13|     26|     39|        6.5|
|   B| Forward|     7|     14|     21|        3.5|
+----+--------+------+-------+-------+-----------+

Notice that three new columns have been added to the DataFrame whose values are based on the existing values in the points colulmn.

x