How can I add a new column with a constant value in PySpark?

To add a new column with a constant value in PySpark, you can use the `withColumn()` method and specify the constant value you want to add as a parameter. This will create a new column with the same value for each row in the Spark dataframe. This method is useful for adding a fixed value to all rows in a dataset, such as a timestamp or a category label. By using this method, you can easily manipulate and transform your data in PySpark.

PySpark: Add New Column with Constant Value


You can use the following methods to add a new column with a constant value to a PySpark DataFrame:

Method 1: Add New Column with Constant Numeric Value

from pyspark.sql.functions import lit

#add new column called 'salary' with value of 100 for each row
df.withColumn('salary', lit(100)).show()

Method 2: Add New Column with Constant String Value

from pyspark.sql.functions import lit

#add new column called 'league' with value of 'NBA' for each row
df.withColumn('league', lit('NBA')).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Add New Column with Constant Numeric Value

We can use the following syntax to add a new column to the DataFrame called salary that contains a value of 100 for each row:

from pyspark.sql.functions import lit

#add new column called 'salary' with value of 100 for each row
df.withColumn('salary', lit(100)).show()


+----+----------+------+-------+------+
|team|conference|points|assists|salary|
+----+----------+------+-------+------+
|   A|      East|    11|      4|   100|
|   A|      East|     8|      9|   100|
|   A|      East|    10|      3|   100|
|   B|      West|     6|     12|   100|
|   B|      West|     6|      4|   100|
|   C|      East|     5|      2|   100|
+----+----------+------+-------+------+

Notice that the new column called salary has been added to the end of the DataFrame and each value in this new column is equal to 100, just as we specified.

Example 2: Add New Column with Constant String Value

We can use the following syntax to add a new column to the DataFrame called league that contains a value of ‘NBA’ for each row:

from pyspark.sql.functions import lit

#add new column called 'league' with value of 'NBA' for each row
df.withColumn('league', lit('NBA')).show()

+----+----------+------+-------+------+
|team|conference|points|assists|league|
+----+----------+------+-------+------+
|   A|      East|    11|      4|   NBA|
|   A|      East|     8|      9|   NBA|
|   A|      East|    10|      3|   NBA|
|   B|      West|     6|     12|   NBA|
|   B|      West|     6|      4|   NBA|
|   C|      East|     5|      2|   NBA|
+----+----------+------+-------+------+

Notice that the new column called league has been added to the end of the DataFrame and each value in this new column is equal to NBA, just as we specified.

Note #1: The withColumn function returns a new DataFrame with a specific column modified and all other columns left the same.

Note #2: The lit function creates a column with a literal value.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x