Table of Contents
Outliers are data points that deviate significantly from the overall pattern of a dataset and can distort analysis results. In R, there are various methods that can be used to identify outliers in a dataset. One method is using the boxplot function, which visually displays the distribution of the data and identifies any extreme values beyond the upper and lower limits. Another method is calculating the Z-score for each data point, which measures the number of standard deviations a data point is away from the mean. Data points with a Z-score greater than a certain threshold (typically 3 or 4) can be considered outliers. Additionally, the Cook’s distance method can be used to identify influential data points that have a significant impact on the regression model. These methods can help researchers identify and handle outliers in their data to ensure accurate and reliable analysis results.
Find Outliers in R (3 Methods)
There are three common ways to identify in a data frame in R:
Method 1: Use the Interquartile Range
We can define an observation to be an outlier if it is 1.5 times the interquartile range greater than the third quartile (Q3) or 1.5 times the interquartile range less than the first quartile (Q1).
#find Q1, Q3, and interquartile range for values in points column Q1 <- quantile(df$points, .25) Q3 <- quantile(df$points, .75) IQR <- IQR(df$points) #subset data where points value is outside 1.5*IQR of Q1 and Q3 outliers <- subset(df, df$points<(Q1 - 1.5*IQR) | df$points>(Q3 + 1.5*IQR))
Method 2: Use Z-Scores
We can also define an observation to be an outlier if it has a z-score less than -3 or greater than 3.
#create new column that calculates z-score of each value in points column df$z <- (df$points-mean(df$points))/sd(df$points) #subset data frame where z-score of points value is greater than 3 outliers <- df[df$z>3, ]
Method 3: Use Hampel Filter
We can also define an observation to be an outlier if it has a value outside of the median ± 3 median absolute deviations. This is known as the Hampel Filter.
#calculate low and high bounds low <- median(df$points) - 3 * mad(df$points, constant=1) high <- median(df$points) + 3 * mad(df$points, constant=1) #subset dataframe where points value is outside of low and high bounds outliers <- subset(df, df$points<low | df$points>high)
The following examples show how to use each method in practice with the following data frame in R that shows the number of points scored by various basketball players:
#create data frame df <- data.frame(player=LETTERS[0:15], points=c(7, 12, 7, 8, 8, 10, 72, 12, 6, 6, 24, 7, 13, 4, 12)) #view data frame df player points 1 A 7 2 B 12 3 C 7 4 D 8 5 E 8 6 F 10 7 G 72 8 H 12 9 I 6 10 J 6 11 K 24 12 L 7 13 M 13 14 N 4 15 O 12
Example 1: Find Outliers Using Interquartile Range
We can use the following code to identify rows with outliers in the points column based on the interquartile range method:
#find Q1, Q3, and interquartile range for values in points column Q1 <- quantile(df$points, .25) Q3 <- quantile(df$points, .75) IQR <- IQR(df$points) #subset data where points value is outside 1.5*IQR of Q1 and Q3 outliers <- subset(df, df$points<(Q1 - 1.5*IQR) | df$points>(Q3 + 1.5*IQR)) #view outliers outliers player points 7 G 72 11 K 24
Using this method, we identify 2 rows as outliers in the data frame.
Example 2: Find Outliers Using Z-Scores
#create new column that calculates z-score of each value in points column df$z <- (df$points-mean(df$points))/sd(df$points) #subset data frame where z-score of points value is greater than 3 outliers <- df[df$z>3, ] #view outliers outliers player points z 7 G 72 3.46542
Using this method, we identify 1 row as an outlier in the data frame.
Example 3: Find Outliers Using Hampel Filter
We can use the following code to identify rows with outliers in the points column based on the Hampel Filter:
#calculate low and high bounds low <- median(df$points) - 3 * mad(df$points, constant=1) high <- median(df$points) + 3 * mad(df$points, constant=1) #subset dataframe where points value is outside of low and high bounds outliers <- subset(df, df$points<low | df$points>high) #view outliers outliers player points 7 G 72 11 K 24
Using this method, we identify 2 rows as outliers in the data frame.
Additional Resources
The following tutorials explain how to perform other common tasks in R: