PySpark: Add New Column with Constant Value


You can use the following methods to add a new column with a constant value to a PySpark DataFrame:

Method 1: Add New Column with Constant Numeric Value

from pyspark.sql.functions import lit

#add new column called 'salary' with value of 100 for each row
df.withColumn('salary', lit(100)).show()

Method 2: Add New Column with Constant String Value

from pyspark.sql.functions import lit

#add new column called 'league' with value of 'NBA' for each row
df.withColumn('league', lit('NBA')).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Add New Column with Constant Numeric Value

We can use the following syntax to add a new column to the DataFrame called salary that contains a value of 100 for each row:

from pyspark.sql.functions import lit

#add new column called 'salary' with value of 100 for each row
df.withColumn('salary', lit(100)).show()


+----+----------+------+-------+------+
|team|conference|points|assists|salary|
+----+----------+------+-------+------+
|   A|      East|    11|      4|   100|
|   A|      East|     8|      9|   100|
|   A|      East|    10|      3|   100|
|   B|      West|     6|     12|   100|
|   B|      West|     6|      4|   100|
|   C|      East|     5|      2|   100|
+----+----------+------+-------+------+

Notice that the new column called salary has been added to the end of the DataFrame and each value in this new column is equal to 100, just as we specified.

Example 2: Add New Column with Constant String Value

We can use the following syntax to add a new column to the DataFrame called league that contains a value of ‘NBA’ for each row:

from pyspark.sql.functions import lit

#add new column called 'league' with value of 'NBA' for each row
df.withColumn('league', lit('NBA')).show()

+----+----------+------+-------+------+
|team|conference|points|assists|league|
+----+----------+------+-------+------+
|   A|      East|    11|      4|   NBA|
|   A|      East|     8|      9|   NBA|
|   A|      East|    10|      3|   NBA|
|   B|      West|     6|     12|   NBA|
|   B|      West|     6|      4|   NBA|
|   C|      East|     5|      2|   NBA|
+----+----------+------+-------+------+

Notice that the new column called league has been added to the end of the DataFrame and each value in this new column is equal to NBA, just as we specified.

Note #1: The withColumn function returns a new DataFrame with a specific column modified and all other columns left the same.

Note #2: The lit function creates a column with a literal value.

x