How can I coalesce values from multiple columns into one column in PySpark?

The process of coalescing values from multiple columns into one column in PySpark is a method of combining data from various columns into a single column. This can be achieved by using the “coalesce” function in PySpark, which takes multiple columns as input and combines them into one column. This is useful for manipulating and organizing data, as well as for performing further analysis and calculations. Additionally, the coalesce function allows for efficient handling of null values, making it a valuable tool for data cleaning and preprocessing. Overall, utilizing the coalesce function in PySpark enables for the consolidation and simplification of data, making it a crucial technique for data manipulation and processing.

PySpark: Coalesce Values from Multiple Columns into One


You can use the following syntax to coalesce the values from multiple columns into one in a PySpark DataFrame:

from pyspark.sql.functions import coalesce

#coalesce values from points, assists and rebounds columns
df = df.withColumn('coalesce', coalesce(df.points, df.assists, df.rebounds))

This particular example creates a new column named coalesce that coalesces the values from the points, assists and rebounds columns into one column.

Note that the coalesce function simply returns the first non-null value in each row among the columns that you specify.

The following example shows how to use this syntax in practice.

Example: Coalesce Values from Multiple Columns into One in PySpark

Suppose we have the following PySpark DataFrame that contains information about the points, assists and rebounds for various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[None, None, 3], 
        [None, 7, 4], 
        [19, 7, None], 
        [None, 9, None], 
        [14, None, 6]]
  
#define column names
columns = ['points', 'assists', 'rebounds'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+------+-------+--------+
|points|assists|rebounds|
+------+-------+--------+
|  null|   null|       3|
|  null|      7|       4|
|    19|      7|    null|
|  null|      9|    null|
|    14|   null|       6|
+------+-------+--------+

We can use the following syntax to create a new column named coalesce that coalesces the values from the points, assists and rebounds columns:

from pyspark.sql.functions import coalesce

#coalesce values from points, assists and rebounds columns
df = df.withColumn('coalesce', coalesce(df.points, df.assists, df.rebounds))

#view updated DataFrame
df.show()

+------+-------+--------+--------+
|points|assists|rebounds|coalesce|
+------+-------+--------+--------+
|  null|   null|       3|       3|
|  null|      7|       4|       7|
|    19|      7|    null|      19|
|  null|      9|    null|       9|
|    14|   null|       6|      14|
+------+-------+--------+--------+

Here is how the value in the coalesce column was chosen:

  • First row: The first non-null value was 3.
  • Second row: The first non-null value was 7.
  • Third row: The first non-null value was 19.
  • Fourth row: The first non-null value was 9.
  • Fifth row: The first non-null value was 14.

Note: You can find the complete documentation for the PySpark coalesce function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x