PySpark: Use fillna() with Specific Columns


You can use the following methods with fillna() to replace null values in specific columns of a PySpark DataFrame:

Method 1: Use fillna() with One Specific Column

df.fillna(0, subset='col1').show()

Method 2: Use fillna() with Several Specific Columns

df.fillna(0, subset=['col1', 'col2']).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', None, 3], 
        ['B', 'West', None, 12], 
        ['B', 'West', 6, 4], 
        ['C', None, 5, None]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|  null|      3|
|   B|      West|  null|     12|
|   B|      West|     6|      4|
|   C|      null|     5|   null|
+----+----------+------+-------+

Example 1: Use fillna() with One Specific Column

We can use the following syntax to fill the null values in the points column only with zeros:

#fill null values in 'points' column with zeros
df.fillna(0, subset='points').show() 

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|     0|      3|
|   B|      West|     0|     12|
|   B|      West|     6|      4|
|   C|      null|     5|   null|
+----+----------+------+-------+

Notice that each of the null values in the points column have been replaced with zeros while the null values in all other columns have been left unchanged.

Example 2: Use fillna() with Several Specific Columns

We can use the following syntax to fill the null values in the points and assists columns with zeros:

#fill null values in 'points' and 'assists' column with zeros
df.fillna(0, subset=['points', 'assists']).show() 

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|     0|      3|
|   B|      West|     0|     12|
|   B|      West|     6|      4|
|   C|      null|     5|      0|
+----+----------+------+-------+

Notice that each of the null values in the points and assists columns have been replaced with zeros while the null values in all other columns have been left unchanged.

Note #1: We chose to replace the null values with 0 but you can use any value you’d like as a replacement.

Note #2: You can find the complete documentation for the PySpark fillna() function .

x