Table of Contents
In order to keep the latest duplicate row while dropping the others in Pandas, you can use the drop_duplicates method. This method takes the subset of columns as an argument and returns the data frame with unique values in the specified columns. It will also keep the last row if multiple rows with the same values are present. You can also specify which row to keep by passing the keep parameter with either ‘first’ or ‘last’ value. This parameter overrides the default behavior which is to keep the first row.
You can use the following basic syntax to drop duplicates from a pandas DataFrame but keep the row with the latest timestamp:
df = df.sort_values('time').drop_duplicates(['item'], keep='last')
This particular example drops rows with duplicate values in the item column, but keeps the row with the latest timestamp in the time column.
The following example shows how to use this syntax in practice.
Example: Drop Duplicates and Keep Latest in Pandas
Suppose we have the following pandas DataFrame that contains information about the sales of various fruits at some grocery store:
import pandas as pd #create DataFrame df = pd.DataFrame({'time': ['2022-10-25 04:00:00', '2022-10-25 11:55:12', '2022-10-26 02:00:00', '2022-10-27 10:30:00', '2022-10-27 14:25:00', '2022-10-28 01:15:27'], 'item': ['apple', 'orange', 'apple', 'mango', 'mango', 'kiwi'], 'sales': [18, 22, 19, 14, 14, 11]}) #convert time column to datetime dtype df['time'] = pd.to_datetime(df['time']) #view DataFrame print(df) time item sales 0 2022-10-25 04:00:00 apple 18 1 2022-10-25 11:55:12 orange 22 2 2022-10-26 02:00:00 apple 19 3 2022-10-27 10:30:00 mango 14 4 2022-10-27 14:25:00 mango 14 5 2022-10-28 01:15:27 kiwi 11
Suppose we would like to remove all rows with duplicate values in the item column but keep the row with the latest timestamp in the time column.
We can use the following syntax to do so:
#drop duplicate rows based on value in 'item' column but keep latest timestamp df = df.sort_values('time').drop_duplicates(['item'], keep='last') #view updated DataFrame print(df) time item sales 1 2022-10-25 11:55:12 orange 22 2 2022-10-26 02:00:00 apple 19 4 2022-10-27 14:25:00 mango 14 5 2022-10-28 01:15:27 kiwi 11
Notice that the item column had multiple rows with ‘apple’ and ‘mango’ as values.
Each of these duplicate rows were removed but the row with the latest timestamp in the time column was kept.
Note: If you would like to remove rows based on duplicate values in multiple columns, simply include multiple column names in the first argument of the drop_duplicates() function.