How can we calculate the median by group in PySpark?

Calculating the median by group in PySpark refers to the process of finding the middle or median value within a specific group of data using the PySpark programming language. This can be achieved by first grouping the data based on a specific column or attribute, and then applying the median function to the grouped data. The median is a useful statistical measure that helps to understand the central tendency of a dataset and is often used in data analysis and machine learning tasks. By using PySpark to calculate the median by group, we can efficiently process large datasets and obtain accurate results.

Calculate the Median by Group in PySpark


You can use the following methods to calculate the median value by group in a PySpark DataFrame:

Method 1: Calculate Median Grouped by One Column

import pyspark.sql.functions as F   

#calculate median of 'points' grouped by 'team' 
df.groupBy('team').agg(F.median('points')).show()

Method 2: Calculate Median Grouped by Multiple Columns

import pyspark.sql.functions as F   

#calculate median of 'points' grouped by 'team' and 'position' 
df.groupBy('team', 'position').agg(F.median('points')).show()

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Guard', 13],
        ['B', 'Forward', 7],
        ['C', 'Guard', 8],
        ['C', 'Forward', 5]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B|   Guard|    13|
|   B| Forward|     7|
|   C|   Guard|     8|
|   C| Forward|     5|
+----+--------+------+

Example 1: Calculate Median Grouped by One Column

We can use the following syntax to calculate the median value in the points column grouped by the values in the team column:

import pyspark.sql.functions as F   

#calculate median of 'points' grouped by 'team' 
df.groupBy('team').agg(F.median('points')).show()

+----+--------------+
|team|median(points)|
+----+--------------+
|   A|          16.5|
|   B|          13.5|
|   C|           6.5|
+----+--------------+

From the output we can see:

  • The median points value for players on team A is 16.5.
  • The median points value for players on team B is 13.5.
  • The median points value for players on team C is 6.5.

Example 2: Calculate Median Grouped by Multiple Columns

We can use the following syntax to calculate the median value in the points column grouped by the values in the team and position columns:

import pyspark.sql.functions as F   

#calculate median of 'points' grouped by 'team' and 'position' 
df.groupBy('team', 'position').agg(F.median('points')).show()

+----+--------+--------------+
|team|position|median(points)|
+----+--------+--------------+
|   A|   Guard|           9.5|
|   A| Forward|          22.0|
|   B|   Guard|          14.0|
|   B| Forward|           7.0|
|   C| Forward|           5.0|
|   C|   Guard|           8.0|
+----+--------+--------------+

From the output we can see:

  • The median points value for Guards on team A is 9.5.
  • The median points value for Forwards on team A is 22.
  • The median points value for Guards on team B is 14.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x