How can I split a string column into multiple columns in PySpark?

In PySpark, a string column can be split into multiple columns by using the “split” function. This function takes in two arguments – the delimiter and the maximum number of splits. The delimiter specifies the character or string that will be used to split the original string column, while the maximum number of splits determines the number of resulting columns. The split function will then return a new DataFrame with the original string column split into multiple columns based on the specified delimiter. This method is useful for organizing and manipulating data in PySpark, allowing for easier analysis and processing of string data.

PySpark: Split String Column into Multiple Columns


You can use the following syntax to split a string column into multiple columns in a PySpark DataFrame:

from pyspark.sql.functions import split

#split team column using dash as delimiter
df_new = df.withColumn('location', split(df.team, '-').getItem(0)) 
           .withColumn('name', split(df.team, '-').getItem(1))

This particular example uses the split function to split the string in the team column of the DataFrame into two new columns called location and name based on where the dash occurs in the string.

The following example shows how to use this syntax in practice.

Example: Split String into Multiple Columns in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Dallas-Mavs', 18], 
        ['Brooklyn-Nets', 33], 
        ['LA-Lakers', 12], 
        ['Houston-Rockets', 15], 
        ['Atlanta-Hawks', 19],
        ['Boston-Celtics', 24],
        ['Orlando-Magic', 28]]

#define column names
columns = ['team', 'points']

#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+---------------+------+
|           team|points|
+---------------+------+
|    Dallas-Mavs|    18|
|  Brooklyn-Nets|    33|
|      LA-Lakers|    12|
|Houston-Rockets|    15|
|  Atlanta-Hawks|    19|
| Boston-Celtics|    24|
|  Orlando-Magic|    28|
+---------------+------+

Suppose we would like to split the strings in the team column into two new columns based on where the dash occurs in the strings.

We can use the following syntax to do so:

from pyspark.sql.functions import split

#split team column using dash as delimiter
df_new = df.withColumn('location', split(df.team, '-').getItem(0)) 
           .withColumn('name', split(df.team, '-').getItem(1))

#view new DataFrame
df_new.show()

+---------------+------+--------+-------+
|           team|points|location|   name|
+---------------+------+--------+-------+
|    Dallas-Mavs|    18|  Dallas|   Mavs|
|  Brooklyn-Nets|    33|Brooklyn|   Nets|
|      LA-Lakers|    12|      LA| Lakers|
|Houston-Rockets|    15| Houston|Rockets|
|  Atlanta-Hawks|    19| Atlanta|  Hawks|
| Boston-Celtics|    24|  Boston|Celtics|
|  Orlando-Magic|    28| Orlando|  Magic|
+---------------+------+--------+-------+

Notice that the strings in the team column have been split into two new columns called location and name based on where the dash occurred in the string.

Note that we used the split function to split each string, which resulted in two new strings.

We then used getItem(0) to extract the first string and getItem(1) to extract the second string for each team.

Note: You can find the complete documentation for the PySpark split function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x