Table of Contents
Dropping duplicate rows in a Pandas DataFrame is a straightforward process. You can use the DataFrame.drop_duplicates() method to identify and remove duplicate rows in a DataFrame based on a given subset of columns or all columns. This method will return a DataFrame with the duplicate rows removed. You can also specify whether you want the duplicate rows to be kept or dropped using the keep parameter.
The easiest way to drop duplicate rows in a pandas DataFrame is by using the function, which uses the following syntax:
df.drop_duplicates(subset=None, keep=’first’, inplace=False)
where:
- subset: Which columns to consider for identifying duplicates. Default is all columns.
- keep: Indicates which duplicates (if any) to keep.
- first: Delete all duplicate rows except first.
- last: Delete all duplicate rows except last.
- False: Delete all duplicates.
- inplace: Indicates whether to drop duplicates in place or return a copy of the DataFrame.
This tutorial provides several examples of how to use this function in practice on the following DataFrame:
import pandas as pd #create DataFrame df = pd.DataFrame({'team': ['a', 'b', 'b', 'c', 'c', 'd'], 'points': [3, 7, 7, 8, 8, 9], 'assists': [8, 6, 7, 9, 9, 3]}) #display DataFrame print(df) team points assists 0 a 3 8 1 b 7 6 2 b 7 7 3 c 8 9 4 c 8 9 5 d 9 3
Example 1: Remove Duplicates Across All Columns
The following code shows how to remove rows that have duplicate values across all columns:
df.drop_duplicates()
team points assists
0 a 3 8
1 b 7 6
2 b 7 7
3 c 8 9
5 d 9 3
By default, the drop_duplicates() function deletes all duplicates except the first.
However, we could use the keep=False argument to delete all duplicates entirely:
df.drop_duplicates(keep=False) team points assists 0 a 3 8 1 b 7 6 2 b 7 7 5 d 9 3
Example 2: Remove Duplicates Across Specific Columns
The following code shows how to remove rows that have duplicate values across just the columns titled team and points:
df.drop_duplicates(subset=['team', 'points']) team points assists 0 a 3 8 1 b 7 6 3 c 8 9 5 d 9 3
How to Drop Duplicate Columns in Pandas