How can the minimum value be calculated by group in PySpark?

The process of calculating the minimum value by group in PySpark involves grouping a specific dataset by a chosen column or set of columns, and then using the “agg” function to apply the minimum calculation to the grouped data. This results in a new dataset with the minimum value for each group. This method is commonly used for data analysis and can help in identifying the lowest value within specific categories or groups in a dataset.

Calculate the Minimum by Group in PySpark


You can use the following methods to calculate the minimum value by group in a PySpark DataFrame:

Method 1: Calculate Minimum Grouped by One Column

import pyspark.sql.functions as F   

#calculate minimum of 'points' grouped by 'team' 
df.groupBy('team').agg(F.min('points')).show()

Method 2: Calculate Minimum Grouped by Multiple Columns

import pyspark.sql.functions as F   

#calculate minimum of 'points' grouped by 'team' and 'position' 
df.groupBy('team', 'position').agg(F.min('points')).show()

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Guard', 13],
        ['B', 'Forward', 7],
        ['C', 'Guard', 8],
        ['C', 'Forward', 5]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B|   Guard|    13|
|   B| Forward|     7|
|   C|   Guard|     8|
|   C| Forward|     5|
+----+--------+------+

Example 1: Calculate Minimum Grouped by One Column

We can use the following syntax to calculate the minimum value in the points column grouped by the values in the team column:

import pyspark.sql.functions as F   

#calculate minimum of 'points' grouped by 'team' 
df.groupBy('team').agg(F.min('points')).show()

+----+-----------+
|team|min(points)|
+----+-----------+
|   A|          8|
|   B|          7|
|   C|          5|
+----+-----------+

From the output we can see:

  • The minimum points value for players on team A is 8.
  • The minimum points value for players on team B is 7.
  • The minimum points value for players on team C is 5.

Example 2: Calculate Minimum Grouped by Multiple Columns

We can use the following syntax to calculate the minimum value in the points column grouped by the values in the team and position columns:

import pyspark.sql.functions as F   

#calculate minimum of 'points' grouped by 'team' and 'position' 
df.groupBy('team', 'position').agg(F.min('points')).show()

+----+--------+-----------+
|team|position|min(points)|
+----+--------+-----------+
|   A|   Guard|          8|
|   A| Forward|         22|
|   B|   Guard|         13|
|   B| Forward|          7|
|   C| Forward|          5|
|   C|   Guard|          8|
+----+--------+-----------+

From the output we can see:

  • The minimum points value for Guards on team A is 8.
  • The minimum points value for Forwards on team A is 22.
  • The minimum points value for Guards on team B is 13.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x