How do I Concatenate Columns in PySpark (With Examples)?

PySpark can be used to Concatenate Columns of a DataFrame in multiple ways. One way to do this is to use the concat() function, which takes a list of strings as its argument and returns a single string that is the concatenation of all the strings in the list. Another way to achieve the same result is to use the join() function, which takes a delimiter as its argument and returns a single string that is the concatenation of all the strings in the list, separated by the delimiter. Examples of how to use these functions are provided below.


You can use the following methods to concatenate strings from multiple columns in PySpark:

Method 1: Concatenate Columns

from pyspark.sql.functions import concat

df_new = df.withColumn('team', concat(df.location, df.name))

This particular example uses the concat function to concatenate together the strings in the location and name columns into a new column called team.

Method 2: Concatenate Columns with Separator

from pyspark.sql.functions import concat_ws

df_new = df.withColumn('team', concat_ws(' ', df.location, df.name))

This particular example uses the concat_ws function to concatenate together the strings in the location and name columns into a new column called team, using a space as a separator between the strings. 

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Dallas', 'Mavs', 18], 
        ['Brooklyn', 'Nets', 33], 
        ['LA', 'Lakers', 12], 
        ['Boston', 'Celtics', 15], 
        ['Houston', 'Rockets', 19],
        ['Washington', 'Wizards', 24],
        ['Orlando', 'Magic', 28]] 
  
#define column names
columns = ['location', 'name', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----------+-------+------+
|  location|   name|points|
+----------+-------+------+
|    Dallas|   Mavs|    18|
|  Brooklyn|   Nets|    33|
|        LA| Lakers|    12|
|    Boston|Celtics|    15|
|   Houston|Rockets|    19|
|Washington|Wizards|    24|
|   Orlando|  Magic|    28|
+----------+-------+------+

Example 1: Concatenate Columns in PySpark

We can use the following syntax to concatenate together the strings in the location and name columns into a new column called team:

from pyspark.sql.functions import concat

#concatenate strings in location and name columns
df_new = df.withColumn('team', concat(df.location, df.name))

#view new DataFrame
df_new.show()

+----------+-------+------+-----------------+
|  location|   name|points|             team|
+----------+-------+------+-----------------+
|    Dallas|   Mavs|    18|       DallasMavs|
|  Brooklyn|   Nets|    33|     BrooklynNets|
|        LA| Lakers|    12|         LALakers|
|    Boston|Celtics|    15|    BostonCeltics|
|   Houston|Rockets|    19|   HoustonRockets|
|Washington|Wizards|    24|WashingtonWizards|
|   Orlando|  Magic|    28|     OrlandoMagic|
+----------+-------+------+-----------------+

The new team column concatenates together the strings in the location and name columns.

Note: You can find the complete documentation for the PySpark concat function .

Example 2: Concatenate Columns with Separator in PySpark

We can use the following syntax to concatenate together the strings in the location and name columns into a new column called team, using a space as a separator:

from pyspark.sql.functions import concat_ws

#concatenate strings in location and name columns, using space as separator
df_new = df.withColumn('team', concat_ws(' ', df.location, df.name)) 

#view new DataFrame
df_new.show()

+----------+-------+------+------------------+
|  location|   name|points|              team|
+----------+-------+------+------------------+
|    Dallas|   Mavs|    18|       Dallas Mavs|
|  Brooklyn|   Nets|    33|     Brooklyn Nets|
|        LA| Lakers|    12|         LA Lakers|
|    Boston|Celtics|    15|    Boston Celtics|
|   Houston|Rockets|    19|   Houston Rockets|
|Washington|Wizards|    24|Washington Wizards|
|   Orlando|  Magic|    28|     Orlando Magic|
+----------+-------+------+------------------+

Note: You can find the complete documentation for the PySpark concat_ws function .

The following tutorials explain how to perform other common tasks in PySpark:

x