How can I use PySpark to fill null values in a dataset with the mean value?

PySpark is a powerful tool used for data analysis and manipulation in Python. One common problem when working with datasets is dealing with missing values, also known as null values. In order to ensure accurate analysis and predictions, it is important to handle null values appropriately. One method for filling null values is by using the mean value of the column. This can be easily achieved in PySpark by using the fillna() function, which allows us to specify the column and the value to be used for filling null values. By using PySpark, we can efficiently and effectively fill null values with the mean value, ensuring the integrity and accuracy of our dataset.

PySpark: Fill Null Values with Mean


You can use the following syntax to fill null values with the column mean in a PySpark DataFrame:

from pyspark.sql.functions import mean

#define function to fill null values with column mean
deffillna_mean(df, include=set()): 
    means = df.agg(*(
        mean(x).alias(x) for x in df.columns if x in include
    ))
    return df.fillna(means.first().asDict())

#fill null values with mean in specific columns
df = fillna_mean(df, ['points', 'assists'])

This particular example fills the null values in the points and assists columns of the DataFrame with their respective column means.

The following example shows how to use this syntax in practice.

Example: How to Fill Null Values with Mean in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', None, 2], 
        ['C', 'East', 5, None]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|  null|      2|
|   C|      East|     5|   null|
+----+----------+------+-------+

Notice that both the points and assists columns have one null value.

We can use the following syntax to fill in the null values in each column with the column mean:

from pyspark.sql.functions import mean

#define function to fill null values with column mean
deffillna_mean(df, include=set()): 
    means = df.agg(*(
        mean(x).alias(x) for x in df.columns if x in include
    ))
    return df.fillna(means.first().asDict())

#fill null values with mean in specific columns
df = fillna_mean(df, ['points', 'assists'])

#view updated DataFrame
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     8|      5|
|   C|      East|     5|      6|
+----+----------+------+-------+

Notice that the null values in both the points and assists columns have been replaced with their respective column means.

For example, the null value in the points column has been replaced with 8, which represents the mean value in the points column.

Similarly, the null value in the assists column has been replaced with 6, which represents the mean value in the assists column.

Note: You can find the complete documentation for the PySpark fillna() function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x