Table of Contents
Duplicates in a Pyspark DataFrame can be found by using the .dropDuplicates() method. This method will return a new DataFrame with duplicate rows removed, based on the specified columns. The column(s) can be specified as a list of column names or column indices. This can help to quickly identify and remove duplicate entries from a DataFrame.
There are two common ways to find duplicate rows in a PySpark DataFrame:
Method 1: Find Duplicate Rows Across All Columns
#display rows that have duplicate values across all columns
df.exceptAll(df.dropDuplicates()).show()
Method 2: Find Duplicate Rows Across Specific Columns
#display rows that have duplicate values across 'team' and 'position' columns df.exceptAll(df.dropDuplicates(['team', 'position'])).show()
The following examples show how to use each method in practice with the following PySpark DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'Guard', 11],
['A', 'Guard', 8],
['A', 'Forward', 22],
['A', 'Forward', 22],
['B', 'Guard', 14],
['B', 'Guard', 14],
['B', 'Forward', 13],
['B', 'Forward', 7]]
#define column names
columns = ['team', 'position', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+--------+------+
|team|position|points|
+----+--------+------+
| A| Guard| 11|
| A| Guard| 8|
| A| Forward| 22|
| A| Forward| 22|
| B| Guard| 14|
| B| Guard| 14|
| B| Forward| 13|
| B| Forward| 7|
+----+--------+------+
Example 1: Find Rows with Duplicate Values Across All Columns
We can use the following syntax to find rows that have duplicate values across all columns in the DataFrame:
#display rows that have duplicate values across all columns
df.exceptAll(df.dropDuplicates()).show()
+----+--------+------+
|team|position|points|
+----+--------+------+
| A| Forward| 22|
| B| Guard| 14|
+----+--------+------+
We can see that there are two rows that are exact duplicates of other rows in the DataFrame.
Example 2: Find Rows with Duplicate Values Across Specific Columns
We can use the following syntax to find rows that have duplicate values across the team and position columns in the DataFrame:
#display rows that have duplicate values across 'team' and 'position' columns df.exceptAll(df.dropDuplicates(['team', 'position'])).show() +----+--------+------+ |team|position|points| +----+--------+------+ | A| Guard| 8| | A| Forward| 22| | B| Guard| 14| | B| Forward| 7| +----+--------+------+
The resulting DataFrame contains only the rows with duplicate values across both the team and position columns.
The following tutorials explain how to perform other common tasks in PySpark: