Table of Contents

Unioning two DataFrames with different columns in PySpark involves first creating a new DataFrame with the same columns as the one you want to union with and then using the union() method of the DataFrame to union the two DataFrames. The union() method will take the two DataFrames and combine them into one DataFrame on a row-by-row basis. The rows that do not have data in the other DataFrame will be filled with null values.

You can use the following syntax to perform a union on two PySpark DataFrames that contain different columns:

df_union = df1.unionByName(df2, allowMissingColumns=True)

This particular example performs a union between the PySpark DataFrames named df1 and df2.

By using the argument allowMissingColumns=True, we specify that the set of column names between the two DataFrames are allowed to differ.

The following example shows how to use this syntax in practice.

Example: How to Union DataFrames with Different Columns in PySpark

Suppose we have the following PySpark DataFrame named df1 that contains the columns team, conference and points:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data1 = [['A', 'East', 11], 
        ['B', 'East', 8], 
        ['C', 'East', 31], 
        ['D', 'West', 16], 
        ['E', 'West', 6], 
        ['F', 'East', 5]]
  
#define column names
columns1 = ['team', 'conference', 'points'] 
  
#create DataFrame
df1 = spark.createDataFrame(data1, columns1) 
  
#view DataFrame
df1.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   B|      East|     8|
|   C|      East|    31|
|   D|      West|    16|
|   E|      West|     6|
|   F|      East|     5|
+----+----------+------+

And suppose we have another DataFrame named df2 that contains the columns team and assists:

#define data
data2 = [['G', 4], 
        ['H', 8], 
        ['I', 11], 
        ['J', 5], 
        ['K', 2], 
        ['L', 4]]
  
#define column names
columns2 = ['team', 'assists'] 
  
#create DataFrame
df2 = spark.createDataFrame(data2, columns2) 
  
#view DataFrame
df2.show()

+----+-------+
|team|assists|
+----+-------+
|   G|      4|
|   H|      8|
|   I|     11|
|   J|      5|
|   K|      2|
|   L|      4|
+----+-------+

We can use the following syntax to perform a union on these two DataFrames:

#perform union with df1 and df2
df_union = df1.unionByName(df2, allowMissingColumns=True)

#view final DataFrame
df_union.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|   null|
|   A|      East|     8|   null|
|   A|      East|    31|   null|
|   B|      West|    16|   null|
|   B|      West|     6|   null|
|   C|      East|     5|   null|
|   A|      null|  null|      4|
|   A|      null|  null|      8|
|   A|      null|  null|     11|
|   B|      null|  null|      5|
|   B|      null|  null|      2|
|   C|      null|  null|      4|
+----+----------+------+-------+

The final DataFrame contains all of the rows from both DataFrames and any columns that don’t match between the two DataFrames simply produce null values.

Note: You can find the complete documentation for the PySpark unionByName function .

The following tutorials explain how to perform other common tasks in PySpark:

How to Union DataFrames with Different Columns in PySpark?

Example: How to Union DataFrames with Different Columns in PySpark

Requst a

Scale

Example: How to Union DataFrames with Different Columns in PySpark

Related terms:

Requst a

Scale