How can I count the number of duplicate rows in PySpark?

In order to count the number of duplicate rows in PySpark, one can use the .groupBy() function to group the rows by all columns and then use the .count() function to count the number of rows in each group. The groups with a count greater than 1 would indicate duplicate rows. Additionally, the .dropDuplicates() function can be used to remove the duplicate rows, and the difference between the original and new dataframe’s row count can provide the total number of duplicate rows.

Count Number of Duplicate Rows in PySpark


You can use the following syntax to count the number of duplicate rows in a PySpark DataFrame:

import pyspark.sql.functions as F

df.groupBy(df.columns)
    .count()
    .where(F.col('count') > 1)
    .select(F.sum('count'))
    .show()

The following example shows how to use this syntax in practice.

Example: Count Number of Duplicate Rows in PySpark DataFrame

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['B', 'Forward', 7]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

We can use the following syntax to count the number of duplicate rows in the DataFrame:

import pyspark.sql.functions as F

#count number of duplicate rows in DataFrame
df.groupBy(df.columns)
    .count()
    .where(F.col('count') > 1)
    .select(F.sum('count'))
    .show()

+----------+
|sum(count)|
+----------+
|         4|
+----------+

We can see that there are 4 total duplicate rows in the DataFrame.

To view these duplicate rows, simply remove the last select function from the previous code:

import pyspark.sql.functions as F

#view duplicate rows in DataFrame
df.groupBy(df.columns)
    .count()
    .where(F.col('count') > 1)
    .show()

+----+--------+------+-----+
|team|position|points|count|
+----+--------+------+-----+
|   A| Forward|    22|    2|
|   B|   Guard|    14|    2|
+----+--------+------+-----+

We can see that team A, position Forward and points 22 occurs 2 times.

We can see that team B, position Guard and points 14 also occurs 2 times.

These represent the 4 duplicate rows.

Also note that you could use the collect function instead of the show function to return only a single value that represents the number of duplicate rows:

import pyspark.sql.functions as F

#count number of duplicate rows in DataFrame
df.groupBy(df.columns)
    .count()
    .where(F.col('count') > 1)
    .select(F.sum('count'))
    .collect()[0][0]

4

Notice that this particular syntax returns only the number 4.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x