How can I concatenate columns in PySpark, with examples?

Concatenation is the process of combining multiple columns of data into a single column. In PySpark, this can be achieved using the concat function. This function takes in a list of columns to be concatenated and returns a new column with the combined data. For example, if we have two columns named “first_name” and “last_name”, we can use the concat function to create a new column “full_name” which contains the combined data from both columns. This can be done as follows:

df.withColumn(“full_name”, concat(col(“first_name”), lit(” “), col(“last_name”)))

This will create a new column named “full_name” which will contain the first name and last name separated by a space. We can also use the concat function to concatenate more than two columns. For instance, if we have three columns named “city”, “state”, and “country”, we can create a new column “location” by combining all three columns as follows:

df.withColumn(“location”, concat(col(“city”), lit(“, “), col(“state”), lit(“, “), col(“country”)))

This will create a new column “location” which will contain the full address of the person, with the city, state, and country separated by commas. In summary, the concat function in PySpark allows us to easily combine columns and create new columns with the desired data.

Concatenate Columns in PySpark (With Examples)


You can use the following methods to concatenate strings from multiple columns in PySpark:

Method 1: Concatenate Columns

from pyspark.sql.functions import concat

df_new = df.withColumn('team', concat(df.location, df.name))

This particular example uses the concat function to concatenate together the strings in the location and name columns into a new column called team.

Method 2: Concatenate Columns with Separator

from pyspark.sql.functions import concat_ws

df_new = df.withColumn('team', concat_ws(' ', df.location, df.name))

This particular example uses the concat_ws function to concatenate together the strings in the location and name columns into a new column called team, using a space as a separator between the strings. 

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Dallas', 'Mavs', 18], 
        ['Brooklyn', 'Nets', 33], 
        ['LA', 'Lakers', 12], 
        ['Boston', 'Celtics', 15], 
        ['Houston', 'Rockets', 19],
        ['Washington', 'Wizards', 24],
        ['Orlando', 'Magic', 28]] 
  
#define column names
columns = ['location', 'name', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----------+-------+------+
|  location|   name|points|
+----------+-------+------+
|    Dallas|   Mavs|    18|
|  Brooklyn|   Nets|    33|
|        LA| Lakers|    12|
|    Boston|Celtics|    15|
|   Houston|Rockets|    19|
|Washington|Wizards|    24|
|   Orlando|  Magic|    28|
+----------+-------+------+

Example 1: Concatenate Columns in PySpark

We can use the following syntax to concatenate together the strings in the location and name columns into a new column called team:

from pyspark.sql.functions import concat

#concatenate strings in location and name columns
df_new = df.withColumn('team', concat(df.location, df.name))

#view new DataFrame
df_new.show()

+----------+-------+------+-----------------+
|  location|   name|points|             team|
+----------+-------+------+-----------------+
|    Dallas|   Mavs|    18|       DallasMavs|
|  Brooklyn|   Nets|    33|     BrooklynNets|
|        LA| Lakers|    12|         LALakers|
|    Boston|Celtics|    15|    BostonCeltics|
|   Houston|Rockets|    19|   HoustonRockets|
|Washington|Wizards|    24|WashingtonWizards|
|   Orlando|  Magic|    28|     OrlandoMagic|
+----------+-------+------+-----------------+

The new team column concatenates together the strings in the location and name columns.

Note: You can find the complete documentation for the PySpark concat function .

Example 2: Concatenate Columns with Separator in PySpark

We can use the following syntax to concatenate together the strings in the location and name columns into a new column called team, using a space as a separator:

from pyspark.sql.functions import concat_ws

#concatenate strings in location and name columns, using space as separator
df_new = df.withColumn('team', concat_ws(' ', df.location, df.name)) 

#view new DataFrame
df_new.show()

+----------+-------+------+------------------+
|  location|   name|points|              team|
+----------+-------+------+------------------+
|    Dallas|   Mavs|    18|       Dallas Mavs|
|  Brooklyn|   Nets|    33|     Brooklyn Nets|
|        LA| Lakers|    12|         LA Lakers|
|    Boston|Celtics|    15|    Boston Celtics|
|   Houston|Rockets|    19|   Houston Rockets|
|Washington|Wizards|    24|Washington Wizards|
|   Orlando|  Magic|    28|     Orlando Magic|
+----------+-------+------+------------------+

Note: You can find the complete documentation for the PySpark concat_ws function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x