How to use Groupby and Concatenate Strings in PySpark?

Using the groupBy and concatenate functions in PySpark, you can group data into clusters and then join the clusters together into a single string. The groupBy function can take in any number of columns and separate the data into smaller groups based on the values in those columns. The concatenate function can then be used to join the strings of each group together, allowing you to create a single string from the groups. This is a useful tool for combining multiple columns of data into a single string.


You can use the following syntax to group by one column in a PySpark DataFrame and then concatenate multiple strings together that belong to the same group:

import pyspark.sql.functions as F

#group by store and concatenate list of employee names
df_new = df.groupby('store')
           .agg(F.concat_ws(', ', F.collect_list(df.employee))
           .alias('employee_names'))

This particular example groups the rows of  the DataFrame based on the values in the store column and then concatenates all of the strings in the employee column that belong to the same group.

The following example shows how to use this syntax in practice.

Example: How to Use groupBy and Concatenate Strings in PySpark

Suppose we have the following PySpark DataFrame that contains information about employees at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 1, 'Andy'], 
        ['A', 1, 'Bob'], 
        ['A', 2, 'Chad'], 
        ['B', 2, 'Diane'], 
        ['B', 1, 'Eric'],
        ['B', 4, 'Frida'],
        ['C', 2, 'Greg'],
        ['C', 3, 'Henry']]
  
#define column names
columns = ['store', 'quarter', 'employee'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-----+-------+--------+
|store|quarter|employee|
+-----+-------+--------+
|    A|      1|    Andy|
|    A|      1|     Bob|
|    A|      2|    Chad|
|    B|      2|   Diane|
|    B|      1|    Eric|
|    B|      4|   Frida|
|    C|      2|    Greg|
|    C|      3|   Henry|
+-----+-------+--------+

Suppose we would like to group by the values in the store column and then concatenate together each of the strings in the employee column that belong to the same group.

We can use the following syntax to do so:

import pyspark.sql.functions as F

#group by store and concatenate list of employee names
df_new = df.groupby('store')
           .agg(F.concat_ws(', ', F.collect_list(df.employee))
           .alias('employee_names'))

#view new DataFrame
df_new.show()

+-----+------------------+
|store|    employee_names|
+-----+------------------+
|    A|   Andy, Bob, Chad|
|    B|Diane, Eric, Frida|
|    C|       Greg, Henry|
+-----+------------------+

The new employee_names column uses a comma to concatenate together the names of the employees who work at the same store.

For example:

  • Andy, Bob and Chad all work at store A.
  • Diane, Eric and Frida all work at store B.
  • Greg and Henry both work at store C.

Note that we used the concat_ws function to concatenate together the employee names using a comma as a separator.

However, we could specify a different separator to use when concatenating the strings if we’d like.

For example, we can use the following syntax to concatenate the names of the employees using the & symbol as the separator:

import pyspark.sql.functions as F

#group by store and concatenate list of employee names
df_new = df.groupby('store')
           .agg(F.concat_ws(' & ', F.collect_list(df.employee))
           .alias('employee_names'))

#view new DataFrame
df_new.show()

+-----+--------------------+
|store|      employee_names|
+-----+--------------------+
|    A|   Andy & Bob & Chad|
|    B|Diane & Eric & Frida|
|    C|        Greg & Henry|
+-----+--------------------+

Note: We used the alias function to specify a name to use for the new column that concatenates together the strings of employees.

The following tutorials explain how to perform other common tasks in PySpark:

x