Table of Contents

PySpark is a powerful tool for handling large datasets and it allows for efficient data manipulation and analysis. One of its key functionalities is the ability to perform Groupby Agg on multiple columns. This means that data can be grouped together based on specific criteria and then aggregated using various functions such as sum, count, and average. This allows for a comprehensive analysis of data and helps to identify patterns and trends within the dataset. PySpark’s efficient processing capabilities make it an ideal tool for performing Groupby Agg on multiple columns, making it a popular choice for data scientists and analysts.

You can use the following syntax to group by and perform aggregations on multiple columns in a PySpark DataFrame:

from pyspark.sql.functions import *

#group by team column and aggregate using multiple columns
df.groupBy(df.team.alias('team')).agg(sum('points').alias('sum_pts'), 
                                      mean('points').alias('mean_pts'),
                                      count('assists').alias('count_ast')).show()

This particular example groups the rows of the DataFrame by the team column, then performs the following aggregations:

Calculates the sum of the points column and uses the name sum_pts
Calculates the mean of the points column and uses the name mean_pts
Calculates the count of the assists column and uses the name count_ast

The following example shows how to use this syntax in practice.

Example: How to Use Groupby Agg On Multiple Columns in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11, 5],
        ['A', 'Guard', 8, 4],
        ['A', 'Forward', 22, 3],
        ['A', 'Forward', 22, 6],
        ['B', 'Guard', 14, 3],
        ['B', 'Guard', 14, 5],
        ['B', 'Forward', 13, 7],
        ['B', 'Forward', 14, 8],
        ['C', 'Forward', 23, 2],
        ['C', 'Guard', 30, 5]]
  
#define column names
columns = ['team', 'position', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+-------+
|team|position|points|assists|
+----+--------+------+-------+
|   A|   Guard|    11|      5|
|   A|   Guard|     8|      4|
|   A| Forward|    22|      3|
|   A| Forward|    22|      6|
|   B|   Guard|    14|      3|
|   B|   Guard|    14|      5|
|   B| Forward|    13|      7|
|   B| Forward|    14|      8|
|   C| Forward|    23|      2|
|   C|   Guard|    30|      5|
+----+--------+------+-------+

We can use the following syntax to group the rows by the values in the team column and then calculate several aggregate metrics:

from pyspark.sql.functions import *

#group by team column and aggregate using multiple columns
df.groupBy(df.team.alias('team')).agg(sum('points').alias('sum_pts'), 
                                      mean('points').alias('mean_pts'),
                                      count('assists').alias('count_ast')).show()

+----+-------+--------+---------+
|team|sum_pts|mean_pts|count_ast|
+----+-------+--------+---------+
|   A|     63|   15.75|        4|
|   B|     55|   13.75|        4|
|   C|     53|    26.5|        2|
+----+-------+--------+---------+

The resulting DataFrame shows the sum of the points values, the mean of the points values, and the count of assists values for each team.

For example, we can see:

The sum of points for team A is 63.
The mean of points for team A is 15.75.
The count of assists for team A is 4.

And so on.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How can PySpark be used to perform Groupby Agg on Multiple Columns?

Example: How to Use Groupby Agg On Multiple Columns in PySpark

Additional Resources

Requst a

Scale

Example: How to Use Groupby Agg On Multiple Columns in PySpark

Additional Resources

Related terms:

Requst a

Scale