Table of Contents

Outliers are data points that fall far outside the expected range of values in a dataset. In R, there are three main methods for identifying outliers: 1) using boxplots, which display the distribution of data and any values that fall outside the upper or lower whiskers are considered outliers; 2) calculating the z-score for each data point, and any values with a z-score greater than a certain threshold (usually 3 or 3.5) are considered outliers; and 3) using the Tukey method, which involves calculating the interquartile range (IQR) and identifying outliers as any data points that fall outside a certain number of IQRs above the third quartile or below the first quartile. These three methods provide different approaches for identifying outliers in a dataset.

There are three common ways to identify in a data frame in R:

Method 1: Use the Interquartile Range

We can define an observation to be an outlier if it is 1.5 times the interquartile range greater than the third quartile (Q3) or 1.5 times the interquartile range less than the first quartile (Q1).

#find Q1, Q3, and interquartile range for values in points column
Q1 <- quantile(df$points, .25)
Q3 <- quantile(df$points, .75)
IQR <- IQR(df$points)

#subset data where points value is outside 1.5*IQR of Q1 and Q3
outliers <- subset(df, df$points<(Q1 - 1.5*IQR) | df$points>(Q3 + 1.5*IQR))

Method 2: Use Z-Scores

We can also define an observation to be an outlier if it has a z-score less than -3 or greater than 3.

#create new column that calculates z-score of each value in points column
df$z <- (df$points-mean(df$points))/sd(df$points)

#subset data frame where z-score of points value is greater than 3
outliers <- df[df$z>3, ]

Method 3: Use Hampel Filter

We can also define an observation to be an outlier if it has a value outside of the median ± 3 median absolute deviations. This is known as the Hampel Filter.

#calculate low and high bounds
low <- median(df$points) - 3 * mad(df$points, constant=1)
high <- median(df$points) + 3 * mad(df$points, constant=1)

#subset dataframe where points value is outside of low and high bounds
outliers <- subset(df, df$points<low | df$points>high)

The following examples show how to use each method in practice with the following data frame in R that shows the number of points scored by various basketball players:

#create data frame
df <- data.frame(player=LETTERS[0:15],
                 points=c(7, 12, 7, 8, 8, 10, 72, 12, 6, 6, 24, 7, 13, 4, 12))

#view data frame
df

   player points
1       A      7
2       B     12
3       C      7
4       D      8
5       E      8
6       F     10
7       G     72
8       H     12
9       I      6
10      J      6
11      K     24
12      L      7
13      M     13
14      N      4
15      O     12

Example 1: Find Outliers Using Interquartile Range

We can use the following code to identify rows with outliers in the points column based on the interquartile range method:

#find Q1, Q3, and interquartile range for values in points column
Q1 <- quantile(df$points, .25)
Q3 <- quantile(df$points, .75)
IQR <- IQR(df$points)

#subset data where points value is outside 1.5*IQR of Q1 and Q3
outliers <- subset(df, df$points<(Q1 - 1.5*IQR) | df$points>(Q3 + 1.5*IQR))

#view outliers
outliers

   player points
7       G     72
11      K     24

Using this method, we identify 2 rows as outliers in the data frame.

Example 2: Find Outliers Using Z-Scores

#create new column that calculates z-score of each value in points column
df$z <- (df$points-mean(df$points))/sd(df$points)

#subset data frame where z-score of points value is greater than 3
outliers <- df[df$z>3, ]

#view outliers
outliers

  player points       z
7      G     72 3.46542

Using this method, we identify 1 row as an outlier in the data frame.

Example 3: Find Outliers Using Hampel Filter

We can use the following code to identify rows with outliers in the points column based on the Hampel Filter:

#calculate low and high bounds
low <- median(df$points) - 3 * mad(df$points, constant=1)
high <- median(df$points) + 3 * mad(df$points, constant=1)

#subset dataframe where points value is outside of low and high bounds
outliers <- subset(df, df$points<low | df$points>high)

#view outliers
outliers

   player points
7       G     72
11      K     24

Using this method, we identify 2 rows as outliers in the data frame.

Additional Resources

The following tutorials explain how to perform other common tasks in R:

How can outliers be identified in R using 3 different methods?

Example 1: Find Outliers Using Interquartile Range

Example 2: Find Outliers Using Z-Scores

Example 3: Find Outliers Using Hampel Filter

Additional Resources

Requst a

Scale

Example 1: Find Outliers Using Interquartile Range

Example 2: Find Outliers Using Z-Scores

Example 3: Find Outliers Using Hampel Filter

Additional Resources

Related terms:

Requst a

Scale