How do I extract a substring in PySpark?

PySpark is a Python-based framework used for big data processing and analytics. It provides efficient tools for data manipulation, including the ability to extract substrings from a string. Extracting substrings involves selecting a specific portion of a string based on a given condition or position. This can be achieved in PySpark using various methods such as substring(), substr(), and regexp_extract().

To extract a substring in PySpark, you must first create a DataFrame or column containing the string data. Then, you can use one of the aforementioned methods to extract the desired substring. For example, using the substring() method, you can specify the start and end indices of the substring to be extracted, or you can use the substr() method to specify the starting index and the length of the substring. Additionally, the regexp_extract() method allows you to extract substrings based on a regular expression pattern.

Some examples of how to extract substrings in PySpark include extracting the first name from a full name column, extracting the year from a date column, or extracting specific words from a sentence column. These methods provide flexibility in extracting substrings and can be useful for data cleaning, transformation, and analysis tasks. Overall, PySpark offers efficient and versatile options for extracting substrings, making it a valuable tool for big data processing.

Extract Substring in PySpark (With Examples)


You can use the following methods to extract certain substrings from a column in a PySpark DataFrame:

Method 1: Extract Substring from Beginning of String

from pyspark.sql import functions as F

#extract first three characters from team column
df_new = df.withColumn('first3', F.substring('team', 1, 3))

Method 2: Extract Substring from Middle of String

from pyspark.sql import functions as F

#extract four characters starting from position two in team column
df_new = df.withColumn('mid4', F.substring('team', 2, 4))

Method 3: Extract Substring from End of String

from pyspark.sql import functions as F

#extract last three characters from team column
df_new = df.withColumn('last3', F.substring('team', -3, 3))

Method 4: Extract Substring Before Specific Character

from pyspark.sql import functions as F

#extract all characters before space in team column
df_new = df.withColumn('beforespace', F.substring_index('team', ' ', 1))

Method 5: Extract Substring After Specific Character

from pyspark.sql import functions as F

#extract all characters after space in team column
df_new = df.withColumn('afterspace', F.substring_index('team', ' ', -1))

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Dallas Mavs', 18], 
        ['Brooklyn Nets', 33], 
        ['Atlanta Hawks', 12], 
        ['Boston Celtics', 15], 
        ['Miami Heat', 19],
        ['Cleveland Cavs', 24],
        ['Orlando Magic', 28]] 
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+--------------+------+
|          team|points|
+--------------+------+
|   Dallas Mavs|    18|
| Brooklyn Nets|    33|
| Atlanta Hawks|    12|
|Boston Celtics|    15|
|    Miami Heat|    19|
|Cleveland Cavs|    24|
| Orlando Magic|    28|
+--------------+------+

Example 1: Extract Substring from Beginning of String

We can use the following syntax to extract the first 3 characters from each string in the team column:

from pyspark.sql import functions as F

#extract first three characters from team column
df_new = df.withColumn('first3', F.substring('team', 1, 3))

#view updated DataFrame
df_new.show()

+--------------+------+------+
|          team|points|first3|
+--------------+------+------+
|   Dallas Mavs|    18|   Dal|
| Brooklyn Nets|    33|   Bro|
| Atlanta Hawks|    12|   Atl|
|Boston Celtics|    15|   Bos|
|    Miami Heat|    19|   Mia|
|Cleveland Cavs|    24|   Cle|
| Orlando Magic|    28|   Orl|
+--------------+------+------+

Example 2: Extract Substring from Middle of String

from pyspark.sql import functions as F

#extract four characters starting from position two in team column
df_new = df.withColumn('mid4', F.substring('team', 2, 4))

#view updated DataFrame
df_new.show()

+--------------+------+----+
|          team|points|mid4|
+--------------+------+----+
|   Dallas Mavs|    18|alla|
| Brooklyn Nets|    33|rook|
| Atlanta Hawks|    12|tlan|
|Boston Celtics|    15|osto|
|    Miami Heat|    19|iami|
|Cleveland Cavs|    24|leve|
| Orlando Magic|    28|rlan|
+--------------+------+----+

Example 3: Extract Substring from End of String

We can use the following syntax to extract the last 3 characters from each string in the team column:

from pyspark.sql import functions as F

#extract last three characters from team column
df_new = df.withColumn('last3', F.substring('team', -3, 3))

#view updated DataFrame
df_new.show()

+--------------+------+-----+
|          team|points|last3|
+--------------+------+-----+
|   Dallas Mavs|    18|  avs|
| Brooklyn Nets|    33|  ets|
| Atlanta Hawks|    12|  wks|
|Boston Celtics|    15|  ics|
|    Miami Heat|    19|  eat|
|Cleveland Cavs|    24|  avs|
| Orlando Magic|    28|  gic|
+--------------+------+-----+

Example 4: Extract Substring Before Specific Character

We can use the following syntax to extract all of the characters before the space from each string in the team column:

from pyspark.sql import functions as F

#extract all characters before space in team column
df_new = df.withColumn('beforespace', F.substring_index('team', ' ', 1))

#view updated DataFrame
df_new.show()

+--------------+------+-----------+
|          team|points|beforespace|
+--------------+------+-----------+
|   Dallas Mavs|    18|     Dallas|
| Brooklyn Nets|    33|   Brooklyn|
| Atlanta Hawks|    12|    Atlanta|
|Boston Celtics|    15|     Boston|
|    Miami Heat|    19|      Miami|
|Cleveland Cavs|    24|  Cleveland|
| Orlando Magic|    28|    Orlando|
+--------------+------+-----------+

Example 5: Extract Substring After Specific Character

We can use the following syntax to extract all of the characters after the space from each string in the team column:

from pyspark.sql import functions as F

#extract all characters after space in team column
df_new = df.withColumn('afterspace', F.substring_index('team', ' ', -1))

#view updated DataFrame
df_new.show()

+--------------+------+----------+
|          team|points|afterspace|
+--------------+------+----------+
|   Dallas Mavs|    18|      Mavs|
| Brooklyn Nets|    33|      Nets|
| Atlanta Hawks|    12|     Hawks|
|Boston Celtics|    15|   Celtics|
|    Miami Heat|    19|      Heat|
|Cleveland Cavs|    24|      Cavs|
| Orlando Magic|    28|     Magic|
+--------------+------+----------+

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x