Table of Contents

The mean, also known as the average, can be calculated by group in PySpark using the groupBy() function. This function allows for data to be grouped based on a specific column or set of columns. Once the data is grouped, the mean can be calculated by using the agg() function and specifying the desired calculation, in this case, “mean”. This allows for the mean to be calculated for each group separately, providing valuable insights into the data. This method is useful for analyzing large datasets and identifying patterns or trends within groups.

Calculate the Mean by Group in PySpark

You can use the following methods to calculate the mean value by group in a PySpark DataFrame:

Method 1: Calculate Mean Grouped by One Column

#calculate mean of 'points' grouped by 'team'
df.groupBy('team').mean('points').show()

Method 2: Calculate Mean Grouped by Multiple Columns

#calculate mean of 'points' grouped by 'team' and 'position'
df.groupBy('team', 'position').mean('points').show()

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Guard', 13],
        ['B', 'Forward', 7],
        ['C', 'Guard', 8],
        ['C', 'Forward', 5]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B|   Guard|    13|
|   B| Forward|     7|
|   C|   Guard|     8|
|   C| Forward|     5|
+----+--------+------+

Example 1: Calculate Mean Grouped by One Column

We can use the following syntax to calculate the mean value in the points column grouped by the values in the team column:

#calculate mean of 'points' grouped by 'team'
df.groupBy('team').mean('points').show()

+----+-----------+
|team|avg(points)|
+----+-----------+
|   A|      15.75|
|   B|       12.0|
|   C|        6.5|
+----+-----------+

From the output we can see:

The average points value for players on team A is 15.75.
The average points value for players on team B is 12.
The average points value for players on team C is 6.5.

Example 2: Calculate Mean Grouped by Multiple Columns

We can use the following syntax to calculate the mean value in the points column grouped by the values in the team and position columns:

#calculate mean of 'points' grouped by 'team' and 'position'
df.groupBy('team', 'position').mean('points').show()

+----+--------+------------------+
|team|position|       avg(points)|
+----+--------+------------------+
|   A|   Guard|               9.5|
|   A| Forward|              22.0|
|   B|   Guard|13.666666666666666|
|   B| Forward|               7.0|
|   C| Forward|               5.0|
|   C|   Guard|               8.0|
+----+--------+------------------+

From the output we can see:

The average points value for Guards on team A is 9.5.
The average points value for Forwards on team A is 22.
The average points value for Guards on team B is 13.67.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How can the mean be calculated by group in PySpark?

Calculate the Mean by Group in PySpark

Example 1: Calculate Mean Grouped by One Column

Example 2: Calculate Mean Grouped by Multiple Columns

Additional Resources

Requst a

Scale

Example 1: Calculate Mean Grouped by One Column

Example 2: Calculate Mean Grouped by Multiple Columns

Additional Resources

Related terms:

Requst a

Scale