Table of Contents
To add a new column with a constant value in PySpark, you can use the `withColumn()` method and specify the constant value you want to add as a parameter. This will create a new column with the same value for each row in the Spark dataframe. This method is useful for adding a fixed value to all rows in a dataset, such as a timestamp or a category label. By using this method, you can easily manipulate and transform your data in PySpark.
PySpark: Add New Column with Constant Value
You can use the following methods to add a new column with a constant value to a PySpark DataFrame:
Method 1: Add New Column with Constant Numeric Value
from pyspark.sql.functions import lit #add new column called 'salary' with value of 100 for each row df.withColumn('salary', lit(100)).show()
Method 2: Add New Column with Constant String Value
from pyspark.sql.functions import lit #add new column called 'league' with value of 'NBA' for each row df.withColumn('league', lit('NBA')).show()
The following examples show how to use each method in practice with the following PySpark DataFrame:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'East', 11, 4], ['A', 'East', 8, 9], ['A', 'East', 10, 3], ['B', 'West', 6, 12], ['B', 'West', 6, 4], ['C', 'East', 5, 2]] #define column names columns = ['team', 'conference', 'points', 'assists'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +----+----------+------+-------+ |team|conference|points|assists| +----+----------+------+-------+ | A| East| 11| 4| | A| East| 8| 9| | A| East| 10| 3| | B| West| 6| 12| | B| West| 6| 4| | C| East| 5| 2| +----+----------+------+-------+
Example 1: Add New Column with Constant Numeric Value
We can use the following syntax to add a new column to the DataFrame called salary that contains a value of 100 for each row:
from pyspark.sql.functions import lit #add new column called 'salary' with value of 100 for each row df.withColumn('salary', lit(100)).show() +----+----------+------+-------+------+ |team|conference|points|assists|salary| +----+----------+------+-------+------+ | A| East| 11| 4| 100| | A| East| 8| 9| 100| | A| East| 10| 3| 100| | B| West| 6| 12| 100| | B| West| 6| 4| 100| | C| East| 5| 2| 100| +----+----------+------+-------+------+
Notice that the new column called salary has been added to the end of the DataFrame and each value in this new column is equal to 100, just as we specified.
Example 2: Add New Column with Constant String Value
We can use the following syntax to add a new column to the DataFrame called league that contains a value of ‘NBA’ for each row:
from pyspark.sql.functions import lit #add new column called 'league' with value of 'NBA' for each row df.withColumn('league', lit('NBA')).show() +----+----------+------+-------+------+ |team|conference|points|assists|league| +----+----------+------+-------+------+ | A| East| 11| 4| NBA| | A| East| 8| 9| NBA| | A| East| 10| 3| NBA| | B| West| 6| 12| NBA| | B| West| 6| 4| NBA| | C| East| 5| 2| NBA| +----+----------+------+-------+------+
Notice that the new column called league has been added to the end of the DataFrame and each value in this new column is equal to NBA, just as we specified.
Note #1: The withColumn function returns a new DataFrame with a specific column modified and all other columns left the same.
Note #2: The lit function creates a column with a literal value.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark: