How can I use the fillna() function in PySpark to replace missing values in specific columns?

The fillna() function in PySpark is a useful tool for replacing missing values in specific columns of a dataset. It allows users to specify the columns they want to target and the value they want to use as a replacement. This function is particularly helpful for data preprocessing and cleaning, as it ensures that missing values do not affect the accuracy of the analysis. By using the fillna() function, users can efficiently handle missing data in their PySpark projects and ensure that their results are reliable and accurate.

PySpark: Use fillna() with Specific Columns


You can use the following methods with fillna() to replace null values in specific columns of a PySpark DataFrame:

Method 1: Use fillna() with One Specific Column

df.fillna(0, subset='col1').show()

Method 2: Use fillna() with Several Specific Columns

df.fillna(0, subset=['col1', 'col2']).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', None, 3], 
        ['B', 'West', None, 12], 
        ['B', 'West', 6, 4], 
        ['C', None, 5, None]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|  null|      3|
|   B|      West|  null|     12|
|   B|      West|     6|      4|
|   C|      null|     5|   null|
+----+----------+------+-------+

Example 1: Use fillna() with One Specific Column

We can use the following syntax to fill the null values in the points column only with zeros:

#fill null values in 'points' column with zeros
df.fillna(0, subset='points').show() 

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|     0|      3|
|   B|      West|     0|     12|
|   B|      West|     6|      4|
|   C|      null|     5|   null|
+----+----------+------+-------+

Notice that each of the null values in the points column have been replaced with zeros while the null values in all other columns have been left unchanged.

Example 2: Use fillna() with Several Specific Columns

We can use the following syntax to fill the null values in the points and assists columns with zeros:

#fill null values in 'points' and 'assists' column with zeros
df.fillna(0, subset=['points', 'assists']).show() 

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|     0|      3|
|   B|      West|     0|     12|
|   B|      West|     6|      4|
|   C|      null|     5|      0|
+----+----------+------+-------+

Notice that each of the null values in the points and assists columns have been replaced with zeros while the null values in all other columns have been left unchanged.

Note #1: We chose to replace the null values with 0 but you can use any value you’d like as a replacement.

Note #2: You can find the complete documentation for the PySpark fillna() function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x