How can I efficiently calculate the maximum value within each group in PySpark?

The process of efficiently calculating the maximum value within each group in PySpark involves using the groupBy() function to group the data by a specific column, and then using the agg() function to apply the max() function on the desired column. This allows for the maximum value to be calculated for each group, reducing the need for multiple iterations and improving the overall performance of the calculation. Additionally, utilizing built-in functions such as max() instead of custom UDFs can also help improve efficiency.

Calculate the Max by Group in PySpark


You can use the following methods to calculate the max value by group in a PySpark DataFrame:

Method 1: Calculate Max Grouped by One Column

import pyspark.sql.functions as F   

#calculate max of 'points' grouped by 'team' 
df.groupBy('team').agg(F.max('points')).show()

Method 2: Calculate Max Grouped by Multiple Columns

import pyspark.sql.functions as F   

#calculate max of 'points' grouped by 'team' and 'position' 
df.groupBy('team', 'position').agg(F.max('points')).show()

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Guard', 13],
        ['B', 'Forward', 7],
        ['C', 'Guard', 8],
        ['C', 'Forward', 5]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B|   Guard|    13|
|   B| Forward|     7|
|   C|   Guard|     8|
|   C| Forward|     5|
+----+--------+------+

Example 1: Calculate Max Grouped by One Column

We can use the following syntax to calculate the max value in the points column grouped by the values in the team column:

import pyspark.sql.functions as F   

#calculate max of 'points' grouped by 'team' 
df.groupBy('team').agg(F.max('points')).show()

+----+-----------+
|team|max(points)|
+----+-----------+
|   A|         22|
|   B|         14|
|   C|          8|
+----+-----------+

From the output we can see:

  • The max points value for players on team A is 22.
  • The max points value for players on team B is 14.
  • The max points value for players on team C is 8.

Example 2: Calculate Max Grouped by Multiple Columns

We can use the following syntax to calculate the max value in the points column grouped by the values in the team and position columns:

import pyspark.sql.functions as F   

#calculate max of 'points' grouped by 'team' and 'position' 
df.groupBy('team', 'position').agg(F.max('points')).show()

+----+--------+-----------+
|team|position|max(points)|
+----+--------+-----------+
|   A|   Guard|         11|
|   A| Forward|         22|
|   B|   Guard|         14|
|   B| Forward|          7|
|   C| Forward|          5|
|   C|   Guard|          8|
+----+--------+-----------+

From the output we can see:

  • The max points value for Guards on team A is 11.
  • The max points value for Forwards on team A is 22.
  • The max points value for Guards on team B is 14.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x