How can I calculate the median of a column in PySpark?

Calculating the median of a column in PySpark can be done by first sorting the column in ascending order. Next, the total number of values in the column is determined. If the total number is odd, the median is simply the middle value. However, if the total number is even, the two middle values are averaged to find the median. This can be easily achieved using the PySpark functions `sort` and `approxQuantile`, which takes in the column name, the quantile value of 0.5 (representing the median), and the number of partitions. The result will be a list of the median values for each partition, which can then be further processed to find the final median value. This process allows for a quick and efficient way to calculate the median of a column in PySpark.

Calculate the Median of a Column in PySpark


You can use the following methods to calculate the median of a column in a PySpark DataFrame:

Method 1: Calculate Median for One Specific Column

from pyspark.sql import functions as F

#calculate median of column named 'game1'
df.agg(F.median('game1')).collect()[0][0]

Method 2: Calculate Median for Multiple Columns

from pyspark.sql.functions import median 

#calculate median for game1, game2 and game3 columns
df.select(median(df.game1), median(df.game2), median(df.game3)).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 25, 11, 10], 
        ['Nets', 22, 8, 14], 
        ['Hawks', 14, 22, 10], 
        ['Kings', 30, 22, 35], 
        ['Bulls', 15, 14, 12], 
        ['Blazers', 10, 14, 18]] 
  
#define column names
columns = ['team', 'game1', 'game2', 'game3'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+-----+-----+-----+
|   team|game1|game2|game3|
+-------+-----+-----+-----+
|   Mavs|   25|   11|   10|
|   Nets|   22|    8|   14|
|  Hawks|   14|   22|   10|
|  Kings|   30|   22|   35|
|  Bulls|   15|   14|   12|
|Blazers|   10|   14|   18|
+-------+-----+-----+-----+

Example 1: Calculate Median for One Specific Column

We can use the following syntax to calculate the median of values in the game1 column of the DataFrame only:

from pyspark.sql import functions as F

#calculate median of column named 'game1'
df.agg(F.median('game1')).collect()[0][0]

18.5

The median of values in the game1 column turns out to be 18.5.

We can verify this is correct by manually calculating the median of values in this column:

All values in game1 column: 10, 14, 15, 22, 25, 30

The two “middle” values are 15 and 22. The average of these two values is 18.5, which represents the median.

Example 2: Calculate Median for Multiple Columns

We can use the following syntax to calculate the median of values for the game1, game2 and game3 columns of the DataFrame:

from pyspark.sql.functions import median

#calculate median for game1, game2 and game3 columns
df.select(median(df.game1), median(df.game2), median(df.game3)).show()

+-------------+-------------+-------------+
|median(game1)|median(game2)|median(game3)|
+-------------+-------------+-------------+
|         18.5|         14.0|         13.0|
+-------------+-------------+-------------+
  • The median of values in the game1 column is 19.333.
  • The median of values in the game2 column is 14.
  • The median of values in the game3 column is 13.

Note: If there are null values in the column, the median function will ignore these values by default.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x