How can I remove specific characters from strings in PySpark?

Removing specific characters from strings in PySpark can be achieved by using the built-in function “regexp_replace”. This function takes in a regular expression pattern and replaces it with a specified string, effectively removing the desired characters from the original string. Additionally, the “translate” function can also be used to remove specific characters by replacing them with an empty string. These functions are useful when working with large datasets in PySpark, allowing for efficient and streamlined data cleaning and manipulation.

PySpark: Remove Specific Characters from Strings


You can use the following methods to remove specific characters from strings in a PySpark DataFrame:

Method 1: Remove Specific Characters from String

from pyspark.sql.functions import*#remove 'avs' from each string in team column
df_new = df.withColumn('team', regexp_replace('team', 'avs', ''))

Method 2: Remove Multiple Groups of Specific Characters from String

from pyspark.sql.functions import*#remove 'avs' and 'awks' from each string in team column
df_new = df.withColumn('team', regexp_replace('team', 'avs|awks', ''))

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18], 
        ['Nets', 33], 
        ['Hawks', 12], 
        ['Mavs', 15], 
        ['Hawks', 19],
        ['Cavs', 24],
        ['Magic', 28]] 
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-----+------+
| team|points|
+-----+------+
| Mavs|    18|
| Nets|    33|
|Hawks|    12|
| Mavs|    15|
|Hawks|    19|
| Cavs|    24|
|Magic|    28|
+-----+------+

Example 1: Remove Specific Characters from String

We can use the following syntax to remove “avs” from any string in the team column of the DataFrame:

from pyspark.sql.functions import*#remove 'avs' from each string in team column
df_new = df.withColumn('team', regexp_replace('team', 'avs', ''))

#view new DataFrame
df_new.show()

+-----+------+
| team|points|
+-----+------+
|    M|    18|
| Nets|    33|
|Hawks|    12|
|    M|    15|
|Hawks|    19|
|    C|    24|
|Magic|    28|
+-----+------+

Notice that the string “avs” has been removed from three team names in the team column of the DataFrame.

Example 2: Remove Multiple Groups of Specific Characters from String

We can use the following syntax to remove the strings “avs” and “awks” from any string in the team column of the DataFrame:

from pyspark.sql.functions import*#remove 'avs' and 'awks' from each string in team column
df_new = df.withColumn('team', regexp_replace('team', 'avs|awks', ''))

#view new DataFrame
df_new.show()

+-----+------+
| team|points|
+-----+------+
|    M|    18|
| Nets|    33|
|    H|    12|
|    M|    15|
|    H|    19|
|    C|    24|
|Magic|    28|
+-----+------+

Notice that the strings “avs” and “awks” have both been removed from the team names in the team column of the DataFrame.

Note #1: The regexp_replace function is case-sensitive.

Note #2: You can find the complete documentation for the PySpark regexp_replace function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x