How can I use groupBy on multiple columns in PySpark?


You can use the following syntax to group by multiple columns and perform an aggregation in a PySpark DataFrame:

df.groupBy('team', 'position').sum('points').show()

This particular example calculates the sum of the values in the points column, grouped by the values in the team and position columns of the DataFrame.

By including multiple column names in the groupBy function, you can group by multiple columns before performing some aggregation.

The following example shows how to use this syntax in practice.

Example: How to Use groupBy On Multiple Columns in PySpark

Suppose we have the following PySpark DataFrame that contains information about the points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['B', 'Forward', 7]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

We can use the following syntax to calculate the sum of the values in the points column, grouped by the values in both the team and position columns:

#calculate sum of points, grouped by team and position
df.groupBy('team', 'position').sum('points').show()

+----+--------+-----------+
|team|position|sum(points)|
+----+--------+-----------+
|   A|   Guard|         19|
|   A| Forward|         44|
|   B|   Guard|         28|
|   B| Forward|         20|
+----+--------+-----------+

The resulting DataFrame shows the sum of the points values, grouped by team and position.

For example, we can see:

  • The sum of points for all guards on team A is 19.
  • The sum of points for all forwards on team A is 44.
  • The sum of points for all guards on team B is 28.
  • The sum of points for all forwards on team B is 20.

If you would like to give the sum(points) column a different name, you can use the alias function as follows:

from pyspark.sql.functions import sum

#calculate sum of points, grouped by team and position
df.groupBy('team', 'position').agg(sum('points').alias('points_sum')).show()

+----+--------+----------+
|team|position|points_sum|
+----+--------+----------+
|   A|   Guard|        19|
|   A| Forward|        44|
|   B|   Guard|        28|
|   B| Forward|        20|
+----+--------+----------+

The resulting DataFrame shows the sum of points scored by each team and position with the sum column now named points_sum, just as we specified in the alias function.

Note #1: In this example we used sum as the aggregation metric, but you could use a different metric such as count, mean, max, etc.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x