Table of Contents
PySpark offers a function called replace() which can be used to replace multiple values in a column. This function takes a dictionary of values to be replaced as an argument, along with the column to be modified. For example, to replace all occurrences of 1, 2, and 3 with 4, 5, and 6 respectively, a dictionary containing the mapping {1:4, 2:5, 3:6} can be passed to the replace() function. The replace() function will then replace all occurrences of 1, 2, and 3 in the specified column with 4, 5, and 6 accordingly.
You can use the following syntax to replace multiple values in one column of a PySpark DataFrame:
from pyspark.sql.functions import when #replace multiple values in 'team' column df_new = df.withColumn('team', when(df.team=='A', 'Atlanta') .when(df.team=='B', 'Boston') .when(df.team=='C', 'Chicago')) .otherwise(df.team))
This particular example makes the following replacements in the team column of the DataFrame:
- Replace ‘A’ with ‘Atlanta’
- Replace ‘B’ with ‘Boston’
- Replace ‘C’ with ‘Chicago’
The following examples show how to use this syntax in practice.
Example: How to Replace Multiple Values in Column of PySpark DataFrame
Suppose we have the following PySpark DataFrame that contains information about various basketball players:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'East', 11],
['A', 'East', 8],
['A', 'East', 31],
['B', 'West', 16],
['B', 'West', 6],
['C', 'East', 5],
['D', 'West', 12],
['D', 'West', 24]]
#define column names
columns = ['team', 'conference', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+----------+------+
|team|conference|points|
+----+----------+------+
| A| East| 11|
| A| East| 8|
| A| East| 31|
| B| West| 16|
| B| West| 6|
| C| East| 5|
| D| West| 12|
| D| West| 24|
+----+----------+------+
We can use the following syntax to replace several values in the team column of the DataFrame:
from pyspark.sql.functions import when #replace multiple values in 'team' column df_new = df.withColumn('team', when(df.team=='A', 'Atlanta') .when(df.team=='B', 'Boston') .when(df.team=='C', 'Chicago')) .otherwise(df.team)) #view new DataFrame df_new.show() +-------+----------+------+ | team|conference|points| +-------+----------+------+ |Atlanta| East| 11| |Atlanta| East| 8| |Atlanta| East| 31| | Boston| West| 16| | Boston| West| 6| |Chicago| East| 5| | D| West| 12| | D| West| 24| +-------+----------+------+
Notice that several of the values in the team column have been replaced with specific new values.
Note that we did not specify a value to replace ‘D’ with in the team column so it simply remained the same.
Note: You can find the complete documentation for the PySpark when function .
The following tutorials explain how to perform other common tasks in PySpark: