How do you vertically concatenate DataFrames in PySpark?

In PySpark, vertically concatenating DataFrames involves combining two or more DataFrames by stacking them on top of each other to create a single DataFrame. This is achieved using the union function, which appends the rows from the second DataFrame to the bottom of the first DataFrame. This results in a larger DataFrame that contains all the rows from both DataFrames. This process is useful for merging multiple datasets with similar columns, allowing for easier analysis and manipulation of the combined data.


You can use the following syntax to vertically concatenate multiple PySpark DataFrames:

from functools import reduce
from pyspark.sql import DataFrame

#specify DataFrames to concatenate
df_list = [df1,df2,df3]

#vertically concatenate all DataFrames in list
df_all = reduce(DataFrame.unionAll, df_list)

This particular example uses the reduce function along with the unionAll function to vertically concatenate the DataFrames named df1, df2 and df3 into one DataFrame called df_all.

The following example shows how to use this syntax in practice.

Example: How to Vertically Concatenate DataFrames in PySpark

Suppose we have three PySpark DataFrames that each contain information about points scored by basketball players on various teams:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data1 = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12]]

data2 = [['Kings', 15], 
        ['Hawks', 19],
        ['Wizards', 24],
        ['Magic', 28]]

data3 = [['Celtics', 25], 
        ['Spurs', 29],
        ['Rockets', 14],
        ['Heat', 30]] 
  
#define column names
columns = ['team', 'points'] 
  
#create dataframes using data and column names
df1 = spark.createDataFrame(data1, columns) 
df2 = spark.createDataFrame(data2, columns)
df3 = spark.createDataFrame(data3, columns)
  
#view dataframes
df1.show()
df2.show()
df3.show()

+------+------+
|  team|points|
+------+------+
|  Mavs|    18|
|  Nets|    33|
|Lakers|    12|
+------+------+

+-------+------+
|   team|points|
+-------+------+
|  Kings|    15|
|  Hawks|    19|
|Wizards|    24|
|  Magic|    28|
+-------+------+

+-------+------+
|   team|points|
+-------+------+
|Celtics|    25|
|  Spurs|    29|
|Rockets|    14|
|   Heat|    30|
+-------+------+

Suppose we would like to vertically concatenate each of the three DataFrames into one DataFrame.

We can use the following syntax to do so:

from functools import reduce
from pyspark.sql import DataFrame

#specify DataFrames to concatenate
df_list = [df1,df2,df3]

#vertically concatenate all DataFrames in list
df_all = reduce(DataFrame.unionAll, df_list)

#view resulting DataFrame
df_all.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|  Kings|    15|
|  Hawks|    19|
|Wizards|    24|
|  Magic|    28|
|Celtics|    25|
|  Spurs|    29|
|Rockets|    14|
|   Heat|    30|
+-------+------+

The new DataFrame named df_all contains the data from all three DataFrames concatenated vertically.

Note: You can find the complete documentation for the PySpark concat function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x