How do you Calculate Quartiles in PySpark?


In statistics, quartiles are values that split up a dataset into four equal parts.

When analyzing a distribution, we’re typically interested in the following quartiles:

  • First Quartile (Q1): The value located at the 25th percentile
  • Second Quartile (Q2): The value located at the 50th percentile
  • Third Quartile (Q3): The value located at the 75th percentile

You can use the following syntax to calculate the quartiles for a column in a PySpark DataFrame:

#calculate quartiles of 'points' column
df.approxQuantile('points', [0.25, 0.5, 0.75], 0)

The following example shows how to use this syntax in practice.

Example: How to Calculate Quartiles in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12], 
        ['Kings', 15], 
        ['Hawks', 19],
        ['Wizards', 24],
        ['Magic', 28],
        ['Jazz', 40],
        ['Thunder', 24],
        ['Spurs', 13]]
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|  Kings|    15|
|  Hawks|    19|
|Wizards|    24|
|  Magic|    28|
|   Jazz|    40|
|Thunder|    24|
|  Spurs|    13|
+-------+------+

We can use the following syntax to calculate the quartiles for the points column:

#calculate quartiles of 'points' column
df.approxQuantile('points', [0.25, 0.5, 0.75], 0)

[15.0, 19.0, 28.0]

From the output we can see:

  • The first quartile is located at 15.
  • The second quartile is located at 19.
  • The third quartile is located at 28.

By knowing only these three values, we can have a good understanding of the distribution of values in the points column.

Note: You can find the complete documentation for the PySpark approxQuantile function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x