How can we coalesce values from multiple columns into one using PySpark?


You can use the following syntax to coalesce the values from multiple columns into one in a PySpark DataFrame:

from pyspark.sql.functions import coalesce

#coalesce values from points, assists and rebounds columns
df = df.withColumn('coalesce', coalesce(df.points, df.assists, df.rebounds))

This particular example creates a new column named coalesce that coalesces the values from the points, assists and rebounds columns into one column.

Note that the coalesce function simply returns the first non-null value in each row among the columns that you specify.

The following example shows how to use this syntax in practice.

Example: Coalesce Values from Multiple Columns into One in PySpark

Suppose we have the following PySpark DataFrame that contains information about the points, assists and rebounds for various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[None, None, 3], 
        [None, 7, 4], 
        [19, 7, None], 
        [None, 9, None], 
        [14, None, 6]]
  
#define column names
columns = ['points', 'assists', 'rebounds'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+------+-------+--------+
|points|assists|rebounds|
+------+-------+--------+
|  null|   null|       3|
|  null|      7|       4|
|    19|      7|    null|
|  null|      9|    null|
|    14|   null|       6|
+------+-------+--------+

We can use the following syntax to create a new column named coalesce that coalesces the values from the points, assists and rebounds columns:

from pyspark.sql.functions import coalesce

#coalesce values from points, assists and rebounds columns
df = df.withColumn('coalesce', coalesce(df.points, df.assists, df.rebounds))

#view updated DataFrame
df.show()

+------+-------+--------+--------+
|points|assists|rebounds|coalesce|
+------+-------+--------+--------+
|  null|   null|       3|       3|
|  null|      7|       4|       7|
|    19|      7|    null|      19|
|  null|      9|    null|       9|
|    14|   null|       6|      14|
+------+-------+--------+--------+

Here is how the value in the coalesce column was chosen:

  • First row: The first non-null value was 3.
  • Second row: The first non-null value was 7.
  • Third row: The first non-null value was 19.
  • Fourth row: The first non-null value was 9.
  • Fifth row: The first non-null value was 14.

Note: You can find the complete documentation for the PySpark coalesce function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x