How can I use the fillna() function in PySpark to fill missing values from one column with values from another column?

The fillna() function in PySpark is a powerful tool that allows for the replacement of missing values in a column with values from another column. This function is particularly useful when working with large datasets, as it can save time and effort by automatically filling in missing values without the need for manual data manipulation. By specifying the target column and the source column, the fillna() function effectively fills in the missing values in the target column with corresponding values from the source column. This functionality can greatly enhance the accuracy and completeness of data analysis and processing in PySpark.

PySpark: Use fillna() with Another Column


You can use the following syntax with fillna() to replace null values in one column with corresponding values from another column in a PySpark DataFrame:

from pyspark.sql.functions import coalesce

df.withColumn('points', coalesce('points', 'points_estimate')).show()

This particular example replaces null values in the points column with corresponding values from the points_estimate column.

The following example shows how to use this syntax in practice.

Example: How to Use fillna() with Another Column in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18, 18], 
        ['Nets', 33, 33], 
        ['Lakers', None, 25], 
        ['Kings', 15, 15], 
        ['Hawks', None, 29],
        ['Wizards', None, 14],
        ['Magic', 28, 28]] 
  
#define column names
columns = ['team', 'points', 'points_estimate'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+---------------+
|   team|points|points_estimate|
+-------+------+---------------+
|   Mavs|    18|             18|
|   Nets|    33|             33|
| Lakers|  null|             25|
|  Kings|    15|             15|
|  Hawks|  null|             29|
|Wizards|  null|             14|
|  Magic|    28|             28|
+-------+------+---------------+

Suppose we would like to fill in all of the null values in the points column with corresponding values from the points_estimate column.

We can use the following syntax to do so:

from pyspark.sql.functions import coalesce

#replace null values in 'points' column with values from 'points_estimate' column
df.withColumn('points', coalesce('points', 'points_estimate')).show()

+-------+------+---------------+
|   team|points|points_estimate|
+-------+------+---------------+
|   Mavs|    18|             18|
|   Nets|    33|             33|
| Lakers|    25|             25|
|  Kings|    15|             15|
|  Hawks|    29|             29|
|Wizards|    14|             14|
|  Magic|    28|             28|
+-------+------+---------------+

Notice that each of the null values in the points column have been replaced with the corresponding values from the points_estimate column.

Note: You can find the complete documentation for the PySpark coalesce() function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x