How can I get the rows from one PySpark DataFrame that are not present in another DataFrame?

To obtain the rows from one PySpark DataFrame that are not present in another DataFrame, you can use the “except” function. This function compares two DataFrames and returns the rows from the first DataFrame that are not present in the second DataFrame. This allows for efficient filtering and identification of differences between two DataFrames. Additionally, the “except” function can be used with other functions such as “intersect” and “union” to further manipulate the data. Overall, utilizing the “except” function in PySpark provides a useful tool for data analysis and manipulation.

PySpark: Get Rows Which Are Not in Another DataFrame


You can use the following syntax to get the rows in one PySpark DataFrame which are not in another DataFrame:

df1.exceptAll(df2).show()

This particular example will return all of the rows from the DataFrame named df1 that are not in the DataFrame named df2.

The following example shows how to use this syntax in practice.

Example: Get Rows from One DataFrame that Are Not in Another DataFrame

Suppose we have the following DataFrame named df1:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data1 = [['A', 18], 
         ['B', 22], 
         ['C', 19], 
         ['D', 14],
         ['E', 30]]

#define column names
columns1 = ['team', 'points'] 
  
#create dataframe using data and column names
df1 = spark.createDataFrame(data1, columns1) 
  
#view dataframe
df1.show()

+----+------+
|team|points|
+----+------+
|   A|    18|
|   B|    22|
|   C|    19|
|   D|    14|
|   E|    30|
+----+------+

And suppose we have another DataFrame named df2:

#define data
data2 = [['A', 18], 
         ['B', 22], 
         ['C', 19], 
         ['F', 22],
         ['G', 29]]

#define column names
columns2 = ['team', 'points'] 
  
#create dataframe using data and column names
df2 = spark.createDataFrame(data2, columns2) 
  
#view dataframe
df2.show()

+----+------+
|team|points|
+----+------+
|   A|    18|
|   B|    22|
|   C|    19|
|   F|    22|
|   G|    29|
+----+------+

We can use the following syntax to return all rows that exist in df1 that do not exist in df2:

#display all rows in df1 that do not exist in df2
df1.exceptAll(df2).show() 

+----+------+
|team|points|
+----+------+
|   D|    14|
|   E|    30|
+----+------+

We can see that there are exactly two rows from the first DataFrame that do not exist in the second DataFrame.

Note: You can find the complete documentation for the PySpark exceptAll function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x