How can I replace a value in a column in PySpark?

In PySpark, it is possible to replace a value in a column only if a certain condition is met. This can be achieved by using the “when” function, which allows for conditional replacement of values in a column. By specifying the condition in the “when” function, the desired replacement value can be specified, ensuring that it is only applied to rows that meet the specified condition. This allows for efficient and targeted replacement of values in a column, making data manipulation in PySpark more precise and effective.

PySpark: Conditionally Replace Value in Column


You can use the following syntax to conditionally replace the value in one column of a PySpark DataFrame based on the value in another column:

from pyspark.sql.functions importwhen

df_new = df.withColumn('points', when(df['conference']=='West', 0).otherwise(df['points']))

This particular example replaces the existing value in the points column with a value of 0 for each row where the corresponding value in the conference column is equal to “West.”

The following examples show how to use this syntax in practice.

Example: How to Conditionally Replace Value in Column of PySpark DataFrame

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11], 
        ['A', 'East', 8], 
        ['A', 'East', 10], 
        ['B', 'West', 6], 
        ['B', 'West', 6], 
        ['C', 'East', 5]] 
  
#define column names
columns = ['team', 'conference', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     6|
|   B|      West|     6|
|   C|      East|     5|
+----+----------+------+

We can use the following syntax to replace the existing value in the points column with a value of 0 for each row where the corresponding value in the conference column is equal to “West.”

from pyspark.sql.functions importwhen#replace value in points column with 0 if value in conference column is 'West'
df_new = df.withColumn('points', when(df['conference']=='West', 0).otherwise(df['points']))

#view new DataFrame
df_new.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     0|
|   B|      West|     0|
|   C|      East|     5|
+----+----------+------+

Notice that the existing values in the points column have been replaced in the two rows where the value in the conference column is equal to “West.”

All other values in the points column have been left unchanged.

Note: You can find the complete documentation for the PySpark when function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x