How can I use PySpark to union DataFrames with different columns?

PySpark is a powerful tool for data processing and analysis that allows users to work with large datasets efficiently. One of its useful features is the ability to union DataFrames with different columns. This allows for the combination of data from multiple sources, even if they have different column names or structures. This process involves using the union function, which appends the rows from one DataFrame to another, creating a new DataFrame with all the columns and data from both sources. By utilizing this feature, users can easily merge and analyze data from diverse sources, making PySpark a valuable tool for data integration and exploration.

PySpark: Union DataFrames with Different Columns


You can use the following syntax to perform a union on two PySpark DataFrames that contain different columns:

df_union = df1.unionByName(df2, allowMissingColumns=True)

This particular example performs a union between the PySpark DataFrames named df1 and df2.

By using the argument allowMissingColumns=True, we specify that the set of column names between the two DataFrames are allowed to differ.

The following example shows how to use this syntax in practice.

Example: How to Union DataFrames with Different Columns in PySpark

Suppose we have the following PySpark DataFrame named df1 that contains the columns team, conference and points:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data1 = [['A', 'East', 11], 
        ['B', 'East', 8], 
        ['C', 'East', 31], 
        ['D', 'West', 16], 
        ['E', 'West', 6], 
        ['F', 'East', 5]]
  
#define column names
columns1 = ['team', 'conference', 'points'] 
  
#create DataFrame
df1 = spark.createDataFrame(data1, columns1) 
  
#view DataFrame
df1.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   B|      East|     8|
|   C|      East|    31|
|   D|      West|    16|
|   E|      West|     6|
|   F|      East|     5|
+----+----------+------+

And suppose we have another DataFrame named df2 that contains the columns team and assists:

#define data
data2 = [['G', 4], 
        ['H', 8], 
        ['I', 11], 
        ['J', 5], 
        ['K', 2], 
        ['L', 4]]
  
#define column names
columns2 = ['team', 'assists'] 
  
#create DataFrame
df2 = spark.createDataFrame(data2, columns2) 
  
#view DataFrame
df2.show()

+----+-------+
|team|assists|
+----+-------+
|   G|      4|
|   H|      8|
|   I|     11|
|   J|      5|
|   K|      2|
|   L|      4|
+----+-------+

We can use the following syntax to perform a union on these two DataFrames:

#perform union with df1 and df2
df_union = df1.unionByName(df2, allowMissingColumns=True)

#view final DataFrame
df_union.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|   null|
|   A|      East|     8|   null|
|   A|      East|    31|   null|
|   B|      West|    16|   null|
|   B|      West|     6|   null|
|   C|      East|     5|   null|
|   A|      null|  null|      4|
|   A|      null|  null|      8|
|   A|      null|  null|     11|
|   B|      null|  null|      5|
|   B|      null|  null|      2|
|   C|      null|  null|      4|
+----+----------+------+-------+

The final DataFrame contains all of the rows from both DataFrames and any columns that don’t match between the two DataFrames simply produce null values.

Note: You can find the complete documentation for the PySpark unionByName function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x