How can I perform a union in PySpark and return only distinct rows?

PySpark is a popular tool for processing large datasets in a distributed computing environment. One common operation in data analysis is to combine multiple datasets using the union operation. In PySpark, the union operation can be performed using the “union” function, which combines two or more dataframes into a single dataframe. To ensure that only distinct rows are returned, the “distinct” function can be used after the union is performed. This will remove any duplicate rows and return only unique rows in the final dataframe. By using the combination of “union” and “distinct” functions in PySpark, users can efficiently perform a union operation and obtain distinct rows in their data analysis workflows.

PySpark: Perform Union and Return Distinct Rows


You can use the following syntax to perform a union on two PySpark DataFrames and return only distinct rows:

df_union_distinct = df1.union(df2).distinct()

This particular example performs a union between the PySpark DataFrames named df1 and df2 and returns only the distinct rows between the two DataFrames.

The following example shows how to use this syntax in practice.

Example: How to Perform Union and Return Distinct Rows in PySpark

Suppose we have the following PySpark DataFrame named df1:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data1 = [['A', 'East', 11], 
         ['B', 'East', 8], 
         ['C', 'East', 31], 
         ['D', 'West', 16], 
         ['E', 'West', 6]]
  
#define column names
columns1 = ['team', 'conference', 'points'] 
  
#create DataFrame
df1 = spark.createDataFrame(data1, columns1) 
  
#view DataFrame
df1.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   B|      East|     8|
|   C|      East|    31|
|   D|      West|    16|
|   E|      West|     6|
+----+----------+------+

And suppose we have another DataFrame named df2:

#define data
data2 = [['A', 'East', 11], 
         ['B', 'East', 8], 
         ['G', 'East', 31], 
         ['H', 'West', 16]]

#define column names
columns2 = ['team', 'conference', 'points'] 
  
#create DataFrame
df2 = spark.createDataFrame(data2, columns2) 
  
#view DataFrame
df2.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   B|      East|     8|
|   G|      East|    31|
|   H|      West|    16|
+----+----------+------+

Suppose we use the following syntax to perform a union between these two DataFrames:

#perform union with df1 and df2
df_union = df1.union(df2)

#view final DataFrame
df_union.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   B|      East|     8|
|   C|      East|    31|
|   D|      West|    16|
|   E|      West|     6|
|   A|      East|    11|
|   B|      East|     8|
|   G|      East|    31|
|   H|      West|    16|
+----+----------+------+

Notice that the rows with team A and B from each DataFrame are shown in the final DataFrame, even though these rows are duplicates.

To perform a union and only return the distinct rows, we can use the following syntax:

#perform union with df1 and df2 and return only distinct rows
df_union_distinct = df1.union(df2).distinct()

#view final DataFrame
df_union_distinct.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   B|      East|     8|
|   C|      East|    31|
|   D|      West|    16|
|   E|      West|     6|
|   G|      East|    31|
|   H|      West|    16|
+----+----------+------+

Notice that we have successfully performed a union between the two DataFrames and only the distinct rows are returned.

Note: You can find the complete documentation for the PySpark union function .

Additional Resources

x