How to fill null values with median in PySpark?

In PySpark, null values can be filled with the median by first creating a list of the non-null values, calculating the median value, and then using the fillna() function to replace the null values with the calculated median. This can be done by using the select() function to get the list of non-null values, the approxQuantile() function to calculate the median, and then the fillna() function to replace the null values with the calculated median.


You can use the following syntax to fill null values with the column median in a PySpark DataFrame:

from pyspark.sql.functions import median

#define function to fill null values with column median
def fillna_median(df, include=set()): 
    medians = df.agg(*(
        median(x).alias(x) for x in df.columns if x in include
    ))
    return df.fillna(medians.first().asDict())

#fill null values with median in specific columns
df = fillna_median(df, ['points', 'assists'])

This particular example fills the null values in the points and assists columns of the DataFrame with their respective column medians.

The following example shows how to use this syntax in practice.

Example: How to Fill Null Values with Median in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', None, 2], 
        ['C', 'East', 5, None]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|  null|      2|
|   C|      East|     5|   null|
+----+----------+------+-------+

Notice that both the points and assists columns have one null value.

We can use the following syntax to fill in the null values in each column with the column median:

from pyspark.sql.functions import median

#define function to fill null values with column median
def fillna_median(df, include=set()): 
    medians = df.agg(*(
        median(x).alias(x) for x in df.columns if x in include
    ))
    return df.fillna(medians.first().asDict())

#fill null values with median in specific columns
df = fillna_median(df, ['points', 'assists'])

#view updated DataFrame
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     8|      2|
|   C|      East|     5|      4|
+----+----------+------+-------+

Notice that the null values in both the points and assists columns have been replaced with their respective column medians.

For example, the null value in the points column has been replaced with 8, which represents the median value in the points column.

Similarly, the null value in the assists column has been replaced with 4, which represents the median value in the assists column.

Note: You can find the complete documentation for the PySpark fillna() function .

The following tutorials explain how to perform other common tasks in PySpark:

x