Table of Contents
PySpark is a powerful tool for processing large datasets in a distributed computing environment. One of its useful functionalities is the ability to handle missing or null values in a dataset. In order to fill these null values with a meaningful value, such as the median, PySpark offers a simple solution. By using the built-in functions and methods in PySpark, one can easily calculate the median for each column in the dataset and replace the null values with it. This allows for efficient and accurate data processing, ensuring that the resulting dataset is complete and suitable for further analysis. With PySpark, handling missing values becomes a seamless process, providing users with a reliable and efficient way to clean and prepare their data for analysis.
PySpark: Fill Null Values with Median
You can use the following syntax to fill null values with the column median in a PySpark DataFrame:
from pyspark.sql.functions import median #define function to fill null values with column median deffillna_median(df, include=set()): medians = df.agg(*( median(x).alias(x) for x in df.columns if x in include )) return df.fillna(medians.first().asDict()) #fill null values with median in specific columns df = fillna_median(df, ['points', 'assists'])
This particular example fills the null values in the points and assists columns of the DataFrame with their respective column medians.
The following example shows how to use this syntax in practice.
Example: How to Fill Null Values with Median in PySpark
Suppose we have the following PySpark DataFrame that contains information about various basketball players:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'East', 11, 4], ['A', 'East', 8, 9], ['A', 'East', 10, 3], ['B', 'West', 6, 12], ['B', 'West', None, 2], ['C', 'East', 5, None]] #define column names columns = ['team', 'conference', 'points', 'assists'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +----+----------+------+-------+ |team|conference|points|assists| +----+----------+------+-------+ | A| East| 11| 4| | A| East| 8| 9| | A| East| 10| 3| | B| West| 6| 12| | B| West| null| 2| | C| East| 5| null| +----+----------+------+-------+
Notice that both the points and assists columns have one null value.
We can use the following syntax to fill in the null values in each column with the column median:
from pyspark.sql.functions import median #define function to fill null values with column median deffillna_median(df, include=set()): medians = df.agg(*( median(x).alias(x) for x in df.columns if x in include )) return df.fillna(medians.first().asDict()) #fill null values with median in specific columns df = fillna_median(df, ['points', 'assists']) #view updated DataFrame df.show() +----+----------+------+-------+ |team|conference|points|assists| +----+----------+------+-------+ | A| East| 11| 4| | A| East| 8| 9| | A| East| 10| 3| | B| West| 6| 12| | B| West| 8| 2| | C| East| 5| 4| +----+----------+------+-------+
Notice that the null values in both the points and assists columns have been replaced with their respective column medians.
For example, the null value in the points column has been replaced with 8, which represents the median value in the points column.
Similarly, the null value in the assists column has been replaced with 4, which represents the median value in the assists column.
Note: You can find the complete documentation for the PySpark fillna() function .
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark: