How can Grubbs’ test be performed in R?

Grubbs’ test is a statistical method used to detect and remove outliers from a dataset. It is commonly used in data analysis to ensure the accuracy and reliability of results. In order to perform Grubbs’ test in R, the following steps can be followed:

1. Load the necessary packages: The first step is to load the “outliers” package in R, which contains the function for performing Grubbs’ test.

2. Import the dataset: The dataset to be analyzed should be imported into R using the appropriate function, such as “read.csv()”.

3. Identify the variable: Specify the column or variable in the dataset that is to be tested for outliers.

4. Perform the test: Using the “grubbs.test()” function from the “outliers” package, the Grubbs’ test can be performed on the chosen variable. This will generate the test statistic, critical value, and the p-value.

5. Identify outliers: Based on the results of the test, outliers can be identified by comparing the test statistic with the critical value. Any values that are higher than the critical value can be considered as outliers.

6. Remove outliers: If necessary, the outliers can be removed from the dataset using the “subset()” function in R.

By following these steps, Grubbs’ test can be easily performed in R, providing a reliable and efficient way to detect and handle outliers in a dataset.

Perform Grubbs’ Test in R


Grubbs’ Test is a statistical test that can be used to identify the presence of outliers in a dataset. To use this test, a dataset should be approximately normally distributed and have at least 7 observations.

This tutorial explains how to perform Grubbs’ Test in R to detect outliers in a dataset.

Example: Grubbs’ Test in R

To perform Grubbs’ Test in R, we can use the grubbs.test() function from the Outliers package, which uses the following syntax:

grubbs.test(x, type = 10, opposite = FALSE, two.sided = FALSE)

where:

  • x: a numeric vector of data values
  • type: 10 = test if max value is outlier, 11 = test if both min and max value are outliers, 20  = test if there are two outliers on one tail
  • opposite: logical indicating whether you want to check not the value with largest difference from the mean, but opposite (lowest, if most suspicious is highest etc.)
  • two-sided: logical value indicating whether or not you should treat the test as two-sided

This test uses the following two hypotheses:

H0 (null hypothesis): There is no outlier in the data.

HA (alternative hypothesis): There is an outlier in the data.

The following example illustrates how to perform Grubbs’ Test to determine if the max value in a dataset is an outlier:

#load Outliers package
library(Outliers)

#create data
data <- c(5, 14, 15, 15, 14, 13, 19, 17, 16, 20, 22, 8, 21, 28, 11, 9, 29, 40)

#perform Grubbs' Test to see if '40' is an outlier
grubbs.test(data)

#	Grubbs test for one outlier
#
#data:  data
#G = 2.65990, U = 0.55935, p-value = 0.02398
#alternative hypothesis: highest value 40 is an outlier

The test statistic of the test is G = 2.65990 and the corresponding p-value is p = 0.02398. Since this value is less than 0.05, we will reject the null hypothesis and conclude that the max value of 40 is an outlier.

If we instead wanted to test whether the lowest value of ‘5’ was an outlier, we could use the opposite=TRUE command:

#perform Grubbs' Test to see if '5' is an outlier
grubbs.test(data, opposite=TRUE)

#	Grubbs test for one outlier
#
#data:  data
#G = 1.4879, U = 0.8621, p-value = 1
#alternative hypothesis: lowest value 5 is an outlier

The test statistic is G = 1.4879 and the corresponding p-value is p = 1. Since this value is not less than 0.05, we fail to reject the null hypothesis. We do not have sufficient evidence to say that the minimum value of ‘5’ is an outlier.

Lastly, suppose we had two large values at one end of the dataset: 40 and 42. To test if both of these values are outliers, we could perform Grubbs’ Test and specify that type=20:

#create dataset with two large values at one end: 40 and 42
data <- c(5, 14, 15, 15, 14, 13, 19, 17, 16, 20, 22, 8, 21, 28, 11, 9, 29, 40, 42) 

#perform Grubbs' Test to see if both 40 and 42 are outliers
grubbs.test(data, type=20)

#	Grubbs test for two outliers
#
#data:  data
#U = 0.38111, p-value = 0.01195
#alternative hypothesis: highest values 40 , 42 are outliers

The p-value of the test is 0.01195. Since this is less than 0.05, we can reject the null hypothesis and conclude that we have sufficient evidence to say the values 40 and 42 are both outliers.

What to Do if an Outlier is Identified

If Grubbs’ Test does identify an outlier in your dataset, you have a few options:

1. Double check to make sure that the value is not a typo or a data entry error. Occasionally, values that show up as outliers in datasets are simply typos made by an individual when entering the data. Go back and verify that the value was entered correctly before you make any further decisions.

2. Assign a new value to the outlier. If the outlier turns out to be a result of a typo or data entry error, you may decide to assign a new value to it, such as  of the dataset.

3.Remove the outlier. If the value is a true outlier, you may choose to remove it if it will have a significant impact on your overall analysis.

x