How can I split a string in a PySpark column and extract the last item from the split list?

This is a formal description on how to split a string within a PySpark column and extract the last item from the resulting list. To do this, you will first need to import the necessary libraries and create a PySpark dataframe with the desired column containing the strings. Next, you can use the .split() function on the column and specify the delimiter to split the strings into a list. Finally, you can use the .getItem() function on the resulting list to extract the last item. This will allow you to efficiently split a string and retrieve the last item from the split list within a PySpark column.

PySpark: Split String in Column and Get Last Item


You can use the following syntax to split a string column in a PySpark DataFrame and get the last item resulting from the split:

from pyspark.sql.functions import split, col, size

#create new column that contains only last item from employees column
df_new = df.withColumn('new', split('employees', ' '))
           .withColumn('new', col('new')[size('new') -1])

This particular example splits the string in the employees column using a space as the delimiter, then extracts the last item from the split and displays it in a new column named last.

The following example shows how to use this syntax in practice.

Example: Split String and Get Last Item in PySpark

Suppose we have the following PySpark DataFrame that contains information employee names and total sales at various companies:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Andy Bob Chad', 200],
        ['Doug Eric', 139],
        ['Frank Greg Henry', 187],
        ['Ian John Ken Liam', 349]]
  
#define column names
columns = ['employees', 'sales'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-----------------+-----+
|        employees|sales|
+-----------------+-----+
|    Andy Bob Chad|  200|
|        Doug Eric|  139|
| Frank Greg Henry|  187|
|Ian John Ken Liam|  349|
+-----------------+-----+

Suppose we would like to split the strings in the employees column and display the last item resulting from each split in a new column.

We can use the following syntax to do so:

from pyspark.sql.functions import split, col, size

#create new column that contains only last item from employees column
df_new = df.withColumn('new', split('employees', ' '))
           .withColumn('new', col('new')[size('new') -1])

#view new DataFrame
df_new.show()

+-----------------+-----+-----+
|        employees|sales| last|
+-----------------+-----+-----+
|    Andy Bob Chad|  200| Chad|
|        Doug Eric|  139| Eric|
| Frank Greg Henry|  187|Henry|
|Ian John Ken Liam|  349| Liam|
+-----------------+-----+-----+

Notice that the new column named last contains the last name from each of the lists in the employees column.

Also note that this syntax was able to get the last item from each list even though the lists had different lengths.

Note: You can find the complete documentation for the PySpark split function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x