How can I calculate quartiles in PySpark?

Calculating quartiles in PySpark refers to the process of determining the values that divide a dataset into four equal parts, representing the 25th, 50th, and 75th percentiles. This can be done by using the PySpark statistical functions such as approxQuantile or percentile_approx, which take in a column or list of columns and a list of quantile values as parameters. These functions use the Spark SQL engine to efficiently process large datasets and return the desired quartile values. By incorporating these functions into your PySpark code, you can easily calculate quartiles and gain insights into the distribution of your data.

Calculate Quartiles in PySpark (With Example)


In statistics, quartiles are values that split up a dataset into four equal parts.

When analyzing a distribution, we’re typically interested in the following quartiles:

  • First Quartile (Q1): The value located at the 25th percentile
  • Second Quartile (Q2): The value located at the 50th percentile
  • Third Quartile (Q3): The value located at the 75th percentile

You can use the following syntax to calculate the quartiles for a column in a PySpark DataFrame:

#calculate quartiles of 'points' column
df.approxQuantile('points', [0.25, 0.5, 0.75], 0)

The following example shows how to use this syntax in practice.

Example: How to Calculate Quartiles in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()#define data
data = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12], 
        ['Kings', 15], 
        ['Hawks', 19],
        ['Wizards', 24],
        ['Magic', 28],
        ['Jazz', 40],
        ['Thunder', 24],
        ['Spurs', 13]]
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|  Kings|    15|
|  Hawks|    19|
|Wizards|    24|
|  Magic|    28|
|   Jazz|    40|
|Thunder|    24|
|  Spurs|    13|
+-------+------+

We can use the following syntax to calculate the quartiles for the points column:

#calculate quartiles of 'points' column
df.approxQuantile('points', [0.25, 0.5, 0.75], 0)

[15.0, 19.0, 28.0]

From the output we can see:

  • The first quartile is located at 15.
  • The second quartile is located at 19.
  • The third quartile is located at 28.

By knowing only these three values, we can have a good understanding of the distribution of values in the points column.

Note: You can find the complete documentation for the PySpark approxQuantile function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x