Table of Contents
In R, there is a built-in function called duplicated() which can be used to identify and remove duplicate rows from a data frame. This function takes a data frame as input and returns a logical vector which can be used to subset the data frame. Several examples of how to use this function are provided so users can understand the syntax and apply it to their own data.
You can use one of the following two methods to remove duplicate rows from a data frame in R:
Method 1: Use Base R
#remove duplicate rows across entire data frame df[!duplicated(df), ] #remove duplicate rows across specific columns of data frame df[!duplicated(df[c('var1')]), ]
Method 2: Use dplyr
#remove duplicate rows across entire data frame df %>% distinct(.keep_all = TRUE) #remove duplicate rows across specific columns of data frame df %>% distinct(var1, .keep_all = TRUE)
The following examples show how to use this syntax in practice with the following data frame:
#define data frame df <- data.frame(team=c('A', 'A', 'A', 'B', 'B', 'B'), position=c('Guard', 'Guard', 'Forward', 'Guard', 'Center', 'Center')) #view data frame df team position 1 A Guard 2 A Guard 3 A Forward 4 B Guard 5 B Center 6 B Center
Example 1: Remove Duplicate Rows Using Base R
The following code shows how to remove duplicate rows from a data frame using functions from base R:
#remove duplicate rows from data frame
df[!duplicated(df), ]
team position
1 A Guard
3 A Forward
4 B Guard
5 B Center
The following code shows how to remove duplicate rows from specific columns of a data frame using base R:
#remove rows where there are duplicates in the 'team' column
df[!duplicated(df[c('team')]), ]
team position
1 A Guard
4 B Guard
Example 2: Remove Duplicate Rows Using dplyr
The following code shows how to remove duplicate rows from a data frame using the distinct() function from the package:
library(dplyr) #remove duplicate rows from data frame df %>% distinct(.keep_all = TRUE) team position 1 A Guard 2 A Forward 3 B Guard 4 B Center
Note that the .keep_all argument tells R to keep all of the columns from the original data frame.
The following code shows how to use the distinct() function to remove duplicate rows from specific columns of a data frame:
library(dplyr) #remove duplicate rows from data frame df %>% distinct(team, .keep_all = TRUE) team position 1 A Guard 2 B Guard
The following tutorials explain how to perform other common functions in R: