How to Find Duplicates in PySpark DataFrame

Duplicates in a Pyspark DataFrame can be found by using the .dropDuplicates() method. This method will return a new DataFrame with duplicate rows removed, based on the specified columns. The column(s) can be specified as a list of column names or column indices. This can help to quickly identify and remove duplicate entries from a DataFrame.


There are two common ways to find duplicate rows in a PySpark DataFrame:

Method 1: Find Duplicate Rows Across All Columns

#display rows that have duplicate values across all columns 
df.exceptAll(df.dropDuplicates()).show()

Method 2: Find Duplicate Rows Across Specific Columns

#display rows that have duplicate values across 'team' and 'position' columns
df.exceptAll(df.dropDuplicates(['team', 'position'])).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['B', 'Forward', 7]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

Example 1: Find Rows with Duplicate Values Across All Columns

We can use the following syntax to find rows that have duplicate values across all columns in the DataFrame:

#display rows that have duplicate values across all columns 
df.exceptAll(df.dropDuplicates()).show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A| Forward|    22|
|   B|   Guard|    14|
+----+--------+------+

We can see that there are two rows that are exact duplicates of other rows in the DataFrame.

Example 2: Find Rows with Duplicate Values Across Specific Columns

We can use the following syntax to find rows that have duplicate values across the team and position columns in the DataFrame:

#display rows that have duplicate values across 'team' and 'position' columns
df.exceptAll(df.dropDuplicates(['team', 'position'])).show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|     8|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B| Forward|     7|
+----+--------+------+

The resulting DataFrame contains only the rows with duplicate values across both the team and position columns.

The following tutorials explain how to perform other common tasks in PySpark:

x