How can I find duplicates in a PySpark DataFrame?

Finding duplicates in a PySpark DataFrame can be accomplished by using the `dropDuplicates()` function. This function identifies and removes any rows that have the same values in all columns, thereby leaving only unique rows in the DataFrame. It can also be used to specify a subset of columns to check for duplicates. Additionally, the `groupBy()` and `count()` functions can be used to retrieve the count of duplicate rows in a specific column or set of columns. This allows for efficient identification and handling of duplicate data in a PySpark DataFrame.

Find Duplicates in PySpark DataFrame


There are two common ways to find duplicate rows in a PySpark DataFrame:

Method 1: Find Duplicate Rows Across All Columns

#display rows that have duplicate values across all columns 
df.exceptAll(df.dropDuplicates()).show()

Method 2: Find Duplicate Rows Across Specific Columns

#display rows that have duplicate values across 'team' and 'position' columns
df.exceptAll(df.dropDuplicates(['team', 'position'])).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['B', 'Forward', 7]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

Example 1: Find Rows with Duplicate Values Across All Columns

We can use the following syntax to find rows that have duplicate values across all columns in the DataFrame:

#display rows that have duplicate values across all columns 
df.exceptAll(df.dropDuplicates()).show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A| Forward|    22|
|   B|   Guard|    14|
+----+--------+------+

We can see that there are two rows that are exact duplicates of other rows in the DataFrame.

Example 2: Find Rows with Duplicate Values Across Specific Columns

We can use the following syntax to find rows that have duplicate values across the team and position columns in the DataFrame:

#display rows that have duplicate values across 'team' and 'position' columns
df.exceptAll(df.dropDuplicates(['team', 'position'])).show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|     8|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B| Forward|     7|
+----+--------+------+

The resulting DataFrame contains only the rows with duplicate values across both the team and position columns.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x