How can I extract a substring in PySpark?

To extract a substring in PySpark, the “substr” function can be used. This function takes in three parameters: the column containing the string, the starting index of the substring, and the length of the substring. For example, if we have a column called “name” that contains strings like “John Smith” and we want to extract only the first name, we can use the substr function as follows: “df.select(substr(df[‘name’],1,4).alias(‘first_name’))”. This will create a new column called “first_name” and extract the first four characters from the “name” column. Other examples of using substr in PySpark include extracting a specific word from a longer string or extracting a portion of a date or timestamp.


You can use the following methods to extract certain substrings from a column in a PySpark DataFrame:

Method 1: Extract Substring from Beginning of String

from pyspark.sql import functions as F

#extract first three characters from team column
df_new = df.withColumn('first3', F.substring('team', 1, 3))

Method 2: Extract Substring from Middle of String

from pyspark.sql import functions as F

#extract four characters starting from position two in team column
df_new = df.withColumn('mid4', F.substring('team', 2, 4))

Method 3: Extract Substring from End of String

from pyspark.sql import functions as F

#extract last three characters from team column
df_new = df.withColumn('last3', F.substring('team', -3, 3))

Method 4: Extract Substring Before Specific Character

from pyspark.sql import functions as F

#extract all characters before space in team column
df_new = df.withColumn('beforespace', F.substring_index('team', ' ', 1))

Method 5: Extract Substring After Specific Character

from pyspark.sql import functions as F

#extract all characters after space in team column
df_new = df.withColumn('afterspace', F.substring_index('team', ' ', -1))

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Dallas Mavs', 18], 
        ['Brooklyn Nets', 33], 
        ['Atlanta Hawks', 12], 
        ['Boston Celtics', 15], 
        ['Miami Heat', 19],
        ['Cleveland Cavs', 24],
        ['Orlando Magic', 28]] 
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+--------------+------+
|          team|points|
+--------------+------+
|   Dallas Mavs|    18|
| Brooklyn Nets|    33|
| Atlanta Hawks|    12|
|Boston Celtics|    15|
|    Miami Heat|    19|
|Cleveland Cavs|    24|
| Orlando Magic|    28|
+--------------+------+

Example 1: Extract Substring from Beginning of String

We can use the following syntax to extract the first 3 characters from each string in the team column:

from pyspark.sql import functions as F

#extract first three characters from team column
df_new = df.withColumn('first3', F.substring('team', 1, 3))

#view updated DataFrame
df_new.show()

+--------------+------+------+
|          team|points|first3|
+--------------+------+------+
|   Dallas Mavs|    18|   Dal|
| Brooklyn Nets|    33|   Bro|
| Atlanta Hawks|    12|   Atl|
|Boston Celtics|    15|   Bos|
|    Miami Heat|    19|   Mia|
|Cleveland Cavs|    24|   Cle|
| Orlando Magic|    28|   Orl|
+--------------+------+------+

Example 2: Extract Substring from Middle of String

from pyspark.sql import functions as F

#extract four characters starting from position two in team column
df_new = df.withColumn('mid4', F.substring('team', 2, 4))

#view updated DataFrame
df_new.show()

+--------------+------+----+
|          team|points|mid4|
+--------------+------+----+
|   Dallas Mavs|    18|alla|
| Brooklyn Nets|    33|rook|
| Atlanta Hawks|    12|tlan|
|Boston Celtics|    15|osto|
|    Miami Heat|    19|iami|
|Cleveland Cavs|    24|leve|
| Orlando Magic|    28|rlan|
+--------------+------+----+

Example 3: Extract Substring from End of String

We can use the following syntax to extract the last 3 characters from each string in the team column:

from pyspark.sql import functions as F

#extract last three characters from team column
df_new = df.withColumn('last3', F.substring('team', -3, 3))

#view updated DataFrame
df_new.show()

+--------------+------+-----+
|          team|points|last3|
+--------------+------+-----+
|   Dallas Mavs|    18|  avs|
| Brooklyn Nets|    33|  ets|
| Atlanta Hawks|    12|  wks|
|Boston Celtics|    15|  ics|
|    Miami Heat|    19|  eat|
|Cleveland Cavs|    24|  avs|
| Orlando Magic|    28|  gic|
+--------------+------+-----+

Example 4: Extract Substring Before Specific Character

We can use the following syntax to extract all of the characters before the space from each string in the team column:

from pyspark.sql import functions as F

#extract all characters before space in team column
df_new = df.withColumn('beforespace', F.substring_index('team', ' ', 1))

#view updated DataFrame
df_new.show()

+--------------+------+-----------+
|          team|points|beforespace|
+--------------+------+-----------+
|   Dallas Mavs|    18|     Dallas|
| Brooklyn Nets|    33|   Brooklyn|
| Atlanta Hawks|    12|    Atlanta|
|Boston Celtics|    15|     Boston|
|    Miami Heat|    19|      Miami|
|Cleveland Cavs|    24|  Cleveland|
| Orlando Magic|    28|    Orlando|
+--------------+------+-----------+

Example 5: Extract Substring After Specific Character

We can use the following syntax to extract all of the characters after the space from each string in the team column:

from pyspark.sql import functions as F

#extract all characters after space in team column
df_new = df.withColumn('afterspace', F.substring_index('team', ' ', -1))

#view updated DataFrame
df_new.show()

+--------------+------+----------+
|          team|points|afterspace|
+--------------+------+----------+
|   Dallas Mavs|    18|      Mavs|
| Brooklyn Nets|    33|      Nets|
| Atlanta Hawks|    12|     Hawks|
|Boston Celtics|    15|   Celtics|
|    Miami Heat|    19|      Heat|
|Cleveland Cavs|    24|      Cavs|
| Orlando Magic|    28|     Magic|
+--------------+------+----------+

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x