How can I use Groupby Agg on multiple columns in PySpark?

Groupby Agg is a function in PySpark that allows you to group data by a specific column or multiple columns and perform aggregate functions on them. This function is useful for analyzing data and calculating various statistics. To use Groupby Agg on multiple columns in PySpark, you can specify the columns you want to group by, and then use the “.agg()” method to specify the aggregate functions you want to apply on those columns. This allows you to perform multiple calculations on different columns simultaneously, making data analysis more efficient and streamlined.

PySpark: Use Groupby Agg on Multiple Columns


You can use the following syntax to group by and perform aggregations on multiple columns in a PySpark DataFrame:

from pyspark.sql.functions import*#group by team column and aggregate using multiple columns
df.groupBy(df.team.alias('team')).agg(sum('points').alias('sum_pts'), 
                                      mean('points').alias('mean_pts'),
                                      count('assists').alias('count_ast')).show()

This particular example groups the rows of the DataFrame by the team column, then performs the following aggregations:

  • Calculates the sum of the points column and uses the name sum_pts
  • Calculates the mean of the points column and uses the name mean_pts
  • Calculates the count of the assists column and uses the name count_ast

The following example shows how to use this syntax in practice.

Example: How to Use Groupby Agg On Multiple Columns in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11, 5],
        ['A', 'Guard', 8, 4],
        ['A', 'Forward', 22, 3],
        ['A', 'Forward', 22, 6],
        ['B', 'Guard', 14, 3],
        ['B', 'Guard', 14, 5],
        ['B', 'Forward', 13, 7],
        ['B', 'Forward', 14, 8],
        ['C', 'Forward', 23, 2],
        ['C', 'Guard', 30, 5]]
  
#define column names
columns = ['team', 'position', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+-------+
|team|position|points|assists|
+----+--------+------+-------+
|   A|   Guard|    11|      5|
|   A|   Guard|     8|      4|
|   A| Forward|    22|      3|
|   A| Forward|    22|      6|
|   B|   Guard|    14|      3|
|   B|   Guard|    14|      5|
|   B| Forward|    13|      7|
|   B| Forward|    14|      8|
|   C| Forward|    23|      2|
|   C|   Guard|    30|      5|
+----+--------+------+-------+

We can use the following syntax to group the rows by the values in the team column and then calculate several aggregate metrics:

from pyspark.sql.functions import*#group by team column and aggregate using multiple columns
df.groupBy(df.team.alias('team')).agg(sum('points').alias('sum_pts'), 
                                      mean('points').alias('mean_pts'),
                                      count('assists').alias('count_ast')).show()

+----+-------+--------+---------+
|team|sum_pts|mean_pts|count_ast|
+----+-------+--------+---------+
|   A|     63|   15.75|        4|
|   B|     55|   13.75|        4|
|   C|     53|    26.5|        2|
+----+-------+--------+---------+

The resulting DataFrame shows the sum of the points values, the mean of the points values, and the count of assists values for each team.

For example, we can see:

  • The sum of points for team A is 63.
  • The mean of points for team A is 15.75.
  • The count of assists for team A is 4.

And so on.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x