How do I use PySpark to perform Union and Return Distinct Rows?

PySpark can be used to perform Union and Return Distinct Rows by using the union() method on two DataFrames and then running a distinct() method on the resultant DataFrame to remove duplicates. This can be done by using the pyspark.sql.functions module which contains the union() and distinct() methods. The union() method will combine two DataFrames and the distinct() method will remove any duplicate rows that may have come from the union.


You can use the following syntax to perform a union on two PySpark DataFrames and return only distinct rows:

df_union_distinct = df1.union(df2).distinct()

This particular example performs a union between the PySpark DataFrames named df1 and df2 and returns only the distinct rows between the two DataFrames.

The following example shows how to use this syntax in practice.

Example: How to Perform Union and Return Distinct Rows in PySpark

Suppose we have the following PySpark DataFrame named df1:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data1 = [['A', 'East', 11], 
         ['B', 'East', 8], 
         ['C', 'East', 31], 
         ['D', 'West', 16], 
         ['E', 'West', 6]]
  
#define column names
columns1 = ['team', 'conference', 'points'] 
  
#create DataFrame
df1 = spark.createDataFrame(data1, columns1) 
  
#view DataFrame
df1.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   B|      East|     8|
|   C|      East|    31|
|   D|      West|    16|
|   E|      West|     6|
+----+----------+------+

And suppose we have another DataFrame named df2:

#define data
data2 = [['A', 'East', 11], 
         ['B', 'East', 8], 
         ['G', 'East', 31], 
         ['H', 'West', 16]]

#define column names
columns2 = ['team', 'conference', 'points'] 
  
#create DataFrame
df2 = spark.createDataFrame(data2, columns2) 
  
#view DataFrame
df2.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   B|      East|     8|
|   G|      East|    31|
|   H|      West|    16|
+----+----------+------+

Suppose we use the following syntax to perform a union between these two DataFrames:

#perform union with df1 and df2
df_union = df1.union(df2)

#view final DataFrame
df_union.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   B|      East|     8|
|   C|      East|    31|
|   D|      West|    16|
|   E|      West|     6|
|   A|      East|    11|
|   B|      East|     8|
|   G|      East|    31|
|   H|      West|    16|
+----+----------+------+

Notice that the rows with team A and B from each DataFrame are shown in the final DataFrame, even though these rows are duplicates.

To perform a union and only return the distinct rows, we can use the following syntax:

#perform union with df1 and df2 and return only distinct rows
df_union_distinct = df1.union(df2).distinct()

#view final DataFrame
df_union_distinct.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   B|      East|     8|
|   C|      East|    31|
|   D|      West|    16|
|   E|      West|     6|
|   G|      East|    31|
|   H|      West|    16|
+----+----------+------+

Notice that we have successfully performed a union between the two DataFrames and only the distinct rows are returned.

Note: You can find the complete documentation for the PySpark union function .

x