How can a string column be split into multiple columns in PySpark?

In PySpark, a string column can be split into multiple columns by using the split() function. This function takes in a delimiter or regular expression pattern as an input and returns an array of strings. This array can then be used to create new columns by using the select() function and specifying the index of the desired string in the array. This allows for efficient and flexible data manipulation, making it easier to work with string data in PySpark.


You can use the following syntax to split a string column into multiple columns in a PySpark DataFrame:

from pyspark.sql.functions import split

#split team column using dash as delimiter
df_new = df.withColumn('location', split(df.team, '-').getItem(0)) 
           .withColumn('name', split(df.team, '-').getItem(1))

This particular example uses the split function to split the string in the team column of the DataFrame into two new columns called location and name based on where the dash occurs in the string.

The following example shows how to use this syntax in practice.

Example: Split String into Multiple Columns in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Dallas-Mavs', 18], 
        ['Brooklyn-Nets', 33], 
        ['LA-Lakers', 12], 
        ['Houston-Rockets', 15], 
        ['Atlanta-Hawks', 19],
        ['Boston-Celtics', 24],
        ['Orlando-Magic', 28]]

#define column names
columns = ['team', 'points']

#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+---------------+------+
|           team|points|
+---------------+------+
|    Dallas-Mavs|    18|
|  Brooklyn-Nets|    33|
|      LA-Lakers|    12|
|Houston-Rockets|    15|
|  Atlanta-Hawks|    19|
| Boston-Celtics|    24|
|  Orlando-Magic|    28|
+---------------+------+

Suppose we would like to split the strings in the team column into two new columns based on where the dash occurs in the strings.

We can use the following syntax to do so:

from pyspark.sql.functions import split

#split team column using dash as delimiter
df_new = df.withColumn('location', split(df.team, '-').getItem(0)) 
           .withColumn('name', split(df.team, '-').getItem(1))

#view new DataFrame
df_new.show()

+---------------+------+--------+-------+
|           team|points|location|   name|
+---------------+------+--------+-------+
|    Dallas-Mavs|    18|  Dallas|   Mavs|
|  Brooklyn-Nets|    33|Brooklyn|   Nets|
|      LA-Lakers|    12|      LA| Lakers|
|Houston-Rockets|    15| Houston|Rockets|
|  Atlanta-Hawks|    19| Atlanta|  Hawks|
| Boston-Celtics|    24|  Boston|Celtics|
|  Orlando-Magic|    28| Orlando|  Magic|
+---------------+------+--------+-------+

Notice that the strings in the team column have been split into two new columns called location and name based on where the dash occurred in the string.

Note that we used the split function to split each string, which resulted in two new strings.

We then used getItem(0) to extract the first string and getItem(1) to extract the second string for each team.

Note: You can find the complete documentation for the PySpark split function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x