How can values in a PySpark column be conditionally replaced?


You can use the following syntax to conditionally replace the value in one column of a PySpark DataFrame based on the value in another column:

from pyspark.sql.functions import when

df_new = df.withColumn('points', when(df['conference']=='West', 0).otherwise(df['points']))

This particular example replaces the existing value in the points column with a value of 0 for each row where the corresponding value in the conference column is equal to “West.”

The following examples show how to use this syntax in practice.

Example: How to Conditionally Replace Value in Column of PySpark DataFrame

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11], 
        ['A', 'East', 8], 
        ['A', 'East', 10], 
        ['B', 'West', 6], 
        ['B', 'West', 6], 
        ['C', 'East', 5]] 
  
#define column names
columns = ['team', 'conference', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     6|
|   B|      West|     6|
|   C|      East|     5|
+----+----------+------+

We can use the following syntax to replace the existing value in the points column with a value of 0 for each row where the corresponding value in the conference column is equal to “West.”

from pyspark.sql.functions import when

#replace value in points column with 0 if value in conference column is 'West'
df_new = df.withColumn('points', when(df['conference']=='West', 0).otherwise(df['points']))

#view new DataFrame
df_new.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     0|
|   B|      West|     0|
|   C|      East|     5|
+----+----------+------+

Notice that the existing values in the points column have been replaced in the two rows where the value in the conference column is equal to “West.”

All other values in the points column have been left unchanged.

Note: You can find the complete documentation for the PySpark when function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x