Table of Contents
In order to find duplicates in a Pandas DataFrame, you can use the duplicated() method. This method will return a Boolean series indicating whether each row is a duplicate or not. You can pass in additional parameters such as subset to check only a specific subset of columns for duplicates, or keep to indicate how to handle the duplicate rows. You can also use the drop_duplicates() method to drop the duplicate rows altogether.
You can use the function to find duplicate values in a pandas DataFrame.
This function uses the following basic syntax:
#find duplicate rows across all columns duplicateRows = df[df.duplicated()] #find duplicate rows across specific columns duplicateRows = df[df.duplicated(['col1', 'col2'])]
The following examples show how to use this function in practice with the following pandas DataFrame:
import pandas as pd #create DataFrame df = pd.DataFrame({'team': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'], 'points': [10, 10, 12, 12, 15, 17, 20, 20], 'assists': [5, 5, 7, 9, 12, 9, 6, 6]}) #view DataFrame print(df) team points assists 0 A 10 5 1 A 10 5 2 A 12 7 3 A 12 9 4 B 15 12 5 B 17 9 6 B 20 6 7 B 20 6
Example 1: Find Duplicate Rows Across All Columns
The following code shows how to find duplicate rows across all of the columns of the DataFrame:
#identify duplicate rows
duplicateRows = df[df.duplicated()]
#view duplicate rows
duplicateRows
team points assists
1 A 10 5
7 B 20 6
There are two rows that are exact duplicates of other rows in the DataFrame.
Note that we can also use the argument keep=’last’ to display the first duplicate rows instead of the last:
#identify duplicate rows
duplicateRows = df[df.duplicated(keep='last')]
#view duplicate rows
print(duplicateRows)
team points assists
0 A 10 5
6 B 20 6
Example 2: Find Duplicate Rows Across Specific Columns
The following code shows how to find duplicate rows across just the ‘team’ and ‘points’ columns of the DataFrame:
#identify duplicate rows across 'team' and 'points' columns
duplicateRows = df[df.duplicated(['team', 'points'])]
#view duplicate rows
print(duplicateRows)
team points assists
1 A 10 5
3 A 12 9
7 B 20 6
There are three rows where the values for the ‘team’ and ‘points’ columns are exact duplicates of previous rows.
Example 3: Find Duplicate Rows in One Column
The following code shows how to find duplicate rows in just the ‘team’ column of the DataFrame:
#identify duplicate rows in 'team' column
duplicateRows = df[df.duplicated(['team'])]
#view duplicate rows
print(duplicateRows)
team points assists
1 A 10 5
2 A 12 7
3 A 12 9
5 B 17 9
6 B 20 6
7 B 20 6
There are six total rows where the values in the ‘team’ column are exact duplicates of previous rows.