PySpark: Use fillna() with Another Column


You can use the following syntax with fillna() to replace null values in one column with corresponding values from another column in a PySpark DataFrame:

from pyspark.sql.functions import coalesce

df.withColumn('points', coalesce('points', 'points_estimate')).show()

This particular example replaces null values in the points column with corresponding values from the points_estimate column.

The following example shows how to use this syntax in practice.

Example: How to Use fillna() with Another Column in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18, 18], 
        ['Nets', 33, 33], 
        ['Lakers', None, 25], 
        ['Kings', 15, 15], 
        ['Hawks', None, 29],
        ['Wizards', None, 14],
        ['Magic', 28, 28]] 
  
#define column names
columns = ['team', 'points', 'points_estimate'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+---------------+
|   team|points|points_estimate|
+-------+------+---------------+
|   Mavs|    18|             18|
|   Nets|    33|             33|
| Lakers|  null|             25|
|  Kings|    15|             15|
|  Hawks|  null|             29|
|Wizards|  null|             14|
|  Magic|    28|             28|
+-------+------+---------------+

Suppose we would like to fill in all of the null values in the points column with corresponding values from the points_estimate column.

We can use the following syntax to do so:

from pyspark.sql.functions import coalesce

#replace null values in 'points' column with values from 'points_estimate' column
df.withColumn('points', coalesce('points', 'points_estimate')).show()

+-------+------+---------------+
|   team|points|points_estimate|
+-------+------+---------------+
|   Mavs|    18|             18|
|   Nets|    33|             33|
| Lakers|    25|             25|
|  Kings|    15|             15|
|  Hawks|    29|             29|
|Wizards|    14|             14|
|  Magic|    28|             28|
+-------+------+---------------+

Notice that each of the null values in the points column have been replaced with the corresponding values from the points_estimate column.

Note: You can find the complete documentation for the PySpark coalesce() function .

x