How can I use PySpark to fill null values in a dataset with the median value for each column?

PySpark is a powerful tool for processing large datasets in a distributed computing environment. One of its useful functionalities is the ability to handle missing or null values in a dataset. In order to fill these null values with a meaningful value, such as the median, PySpark offers a simple solution. By using the built-in functions and methods in PySpark, one can easily calculate the median for each column in the dataset and replace the null values with it. This allows for efficient and accurate data processing, ensuring that the resulting dataset is complete and suitable for further analysis. With PySpark, handling missing values becomes a seamless process, providing users with a reliable and efficient way to clean and prepare their data for analysis.

PySpark: Fill Null Values with Median


You can use the following syntax to fill null values with the column median in a PySpark DataFrame:

from pyspark.sql.functions import median

#define function to fill null values with column median
deffillna_median(df, include=set()): 
    medians = df.agg(*(
        median(x).alias(x) for x in df.columns if x in include
    ))
    return df.fillna(medians.first().asDict())

#fill null values with median in specific columns
df = fillna_median(df, ['points', 'assists'])

This particular example fills the null values in the points and assists columns of the DataFrame with their respective column medians.

The following example shows how to use this syntax in practice.

Example: How to Fill Null Values with Median in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', None, 2], 
        ['C', 'East', 5, None]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|  null|      2|
|   C|      East|     5|   null|
+----+----------+------+-------+

Notice that both the points and assists columns have one null value.

We can use the following syntax to fill in the null values in each column with the column median:

from pyspark.sql.functions import median

#define function to fill null values with column median
deffillna_median(df, include=set()): 
    medians = df.agg(*(
        median(x).alias(x) for x in df.columns if x in include
    ))
    return df.fillna(medians.first().asDict())

#fill null values with median in specific columns
df = fillna_median(df, ['points', 'assists'])

#view updated DataFrame
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     8|      2|
|   C|      East|     5|      4|
+----+----------+------+-------+

Notice that the null values in both the points and assists columns have been replaced with their respective column medians.

For example, the null value in the points column has been replaced with 8, which represents the median value in the points column.

Similarly, the null value in the assists column has been replaced with 4, which represents the median value in the assists column.

Note: You can find the complete documentation for the PySpark fillna() function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x