How can I replace a string in a column using PySpark?

Replacing a string in a column using PySpark refers to the process of modifying a specific string value within a column of a PySpark dataframe. This can be achieved by using the replace function, which allows for the replacement of the specified string with a new value. This function can be applied to a single column or multiple columns within a dataframe, providing a convenient and efficient method for data manipulation in PySpark. By utilizing this feature, users can easily update and transform their data to meet their specific needs.

PySpark: Replace String in Column


You can use the following syntax to replace a specific string in a column of a PySpark DataFrame:

from pyspark.sql.functions import*#replace 'Guard' with 'Gd' in position column
df_new = df.withColumn('position', regexp_replace('position', 'Guard', 'Gd'))

This particular example replaces the string “Guard” with the new string “Gd” in the position column of the DataFrame.

The following examples show how to use this syntax in practice.

Example: Replace String in Column of PySpark DataFrame

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Guard', 13],
        ['B', 'Forward', 7],
        ['C', 'Guard', 8],
        ['C', 'Forward', 5]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B|   Guard|    13|
|   B| Forward|     7|
|   C|   Guard|     8|
|   C| Forward|     5|
+----+--------+------+

We can use the following syntax to replace the string “Guard” with the new string “Gd” in the position column of the DataFrame:

from pyspark.sql.functions import*#replace 'Guard' with 'Gd' in position column
df_new = df.withColumn('position', regexp_replace('position', 'Guard', 'Gd'))

#view new DataFrame
df_new.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|      Gd|    11|
|   A|      Gd|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|      Gd|    14|
|   B|      Gd|    14|
|   B|      Gd|    13|
|   B| Forward|     7|
|   C|      Gd|     8|
|   C| Forward|     5|
+----+--------+------+

From the output we can see that each occurrence of “Guard” has been replaced with “Gd” in the position column of the DataFrame.

Note #1: The regexp_replace function is case-sensitive.

Note #2: You can find the complete documentation for the PySpark regexp_replace function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x