Table of Contents

In PySpark, you can update column values based on a condition by using the withColumn() method. This method takes two arguments: a column name and a conditional expression that defines the value to be assigned to the column. The conditional expression must be written in the form of a lambda function, which allows you to specify the conditions under which the column value should be updated. For example, you could use the following code to update a column called ‘mark’ to ‘pass’ if its value is greater than 50: df.withColumn(“mark”, F.when(F.col(“mark”) > 50, “pass”)) This code will replace all values in the ‘mark’ column that are greater than 50 with the string ‘pass’.

You can use the following syntax to update column values based on a condition in a PySpark DataFrame:

import pyspark.sql.functions as F

#update all values in 'team' column equal to 'A' to now be 'Atlanta'
df = df.withColumn('team', F.when(df.team=='A', 'Atlanta').otherwise(df.team))

This particular example updates all values in the team column equal to ‘A’ to now be ‘Atlanta’ instead.

Any values in the team column not equal to ‘A’ are simply left untouched.

The following examples show how to use this syntax in practice.

Example: Update Column Values Based on Condition in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['B', 'Forward', 7]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

We can use the following syntax to update all of the values in the team column equal to ‘A’ to now be ‘Atlanta’ instead:

import pyspark.sql.functions as F

#update all values in 'team' column equal to 'A' to now be 'Atlanta'
df = df.withColumn('team', F.when(df.team=='A', 'Atlanta').otherwise(df.team))

#view updated DataFrame
df.show()

+-------+--------+------+
|   team|position|points|
+-------+--------+------+
|Atlanta|   Guard|    11|
|Atlanta|   Guard|     8|
|Atlanta| Forward|    22|
|Atlanta| Forward|    22|
|      B|   Guard|    14|
|      B|   Guard|    14|
|      B| Forward|    13|
|      B| Forward|     7|
+-------+--------+------+

From the output we can see that each occurrence of ‘A’ in the team column has been updated to be ‘Atlanta’ instead.

All values in the team column not equal to ‘A’ were simply left the same.

Note: You can find the complete documentation for the PySpark when function .

The following tutorials explain how to perform other common tasks in PySpark:

How do I update column values based on condition in PySpark?

Example: Update Column Values Based on Condition in PySpark

Requst a

Scale

Example: Update Column Values Based on Condition in PySpark

Related terms:

Requst a

Scale