How can an Anti-Join be performed in PySpark?

An Anti-Join is a type of join operation in PySpark that allows for the comparison and selection of data from two different datasets based on a specific condition. It is used to exclude rows from one dataset that match with the other dataset on a given condition. This can be performed in PySpark by using the “join” function with the “how” parameter set to “anti”. This will perform a left outer join and then filter out any rows from the resulting dataset that have a match in the right dataset. The output will contain only the rows from the left dataset that do not have a match in the right dataset. This allows for efficient and accurate data filtering and analysis in PySpark.

Perform an Anti-Join in PySpark


An anti-join allows you to return all rows in one DataFrame that do not have matching values in another DataFrame.

You can use the following syntax to perform an anti-join between two PySpark DataFrames:

df_anti_join = df1.join(df2, on=['team'], how='left_anti')

This particular example will perform an anti-join using the DataFrames named df1 and df2 and will only return the rows from df1 where the value in the team column does not belong in the team column of df2.

The following example shows how to use this syntax in practice.

Example: How to Perform an Anti-Join in PySpark

Suppose we have the following DataFrame named df1:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()#define data
data1 = [['A', 18], 
       ['B', 22], 
       ['C', 19], 
       ['D', 14],
       ['E', 30]]

#define column names
columns1 = ['team', 'points'] 
  
#create dataframe using data and column names
df1 = spark.createDataFrame(data1, columns1) 
  
#view dataframe
df1.show()

+----+------+
|team|points|
+----+------+
|   A|    18|
|   B|    22|
|   C|    19|
|   D|    14|
|   E|    30|
+----+------+

And suppose we have another DataFrame named df2:

#define data
data2 = [['A', 18], 
       ['B', 22], 
       ['C', 19], 
       ['F', 22],
       ['G', 29]]

#define column names
columns2 = ['team', 'points'] 
  
#create dataframe using data and column names
df2 = spark.createDataFrame(data2, columns2) 
  
#view dataframe
df2.show()

+----+------+
|team|points|
+----+------+
|   A|    18|
|   B|    22|
|   C|    19|
|   F|    22|
|   G|    29|
+----+------+

We can use the following syntax to perform an anti-join and return all rows in the first DataFrame that do not have a matching team in the second DataFrame:

#perform anti-join
df_anti_join = df1.join(df2, on=['team'], how='left_anti')

#view resulting DataFrame
df_anti_join.show()

+----+------+
|team|points|
+----+------+
|   D|    14|
|   E|    30|
+----+------+

We can see that there are exactly two teams from the first DataFrame that do not have a matching team name in the second DataFrame.

The anti-join worked as expected.

The end result is one DataFrame that only contains the rows where the team name belongs to the first DataFrame but not the second DataFrame.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x