How can I calculate correlation in R with missing values?

In R, correlation can be calculated with missing values by using the na.omit() function to omit missing values from the data set and the cor() function to calculate the correlation. The cor() function is able to handle missing values that have been omitted using the na.omit() function. The result of the cor() function is the correlation coefficient of the two variables.


You can use the following methods to calculate correlation coefficients in R when one or more variables have missing values:

Method 1: Calculate Correlation Coefficient with Missing Values Present

cor(x, y, use='complete.obs')

Method 2: Calculate Correlation Matrix with Missing Values Present

cor(df, use='pairwise.complete.obs')

The following examples show how to use each method in practice.

Example 1: Calculate Correlation Coefficient with Missing Values Present

Suppose we attempt to use the cor() function to calculate the Pearson correlation coefficient between two variables when missing values are present:

#create two variables
x <- c(70, 78, 90, 87, 84, NA, 91, 74, 83, 85)
y <- c(90, NA, 79, 86, 84, 83, 88, 92, 76, 75)

#attempt to calculate correlation coefficient between x and y
cor(x, y)

[1] NA

The cor() function returns NA since we didn’t specify how to handle missing values.

To avoid this issue, we can use the argument use=’complete.obs’ so that R knows to only use pairwise observations where both values are present:

#create two variables
x <- c(70, 78, 90, 87, 84, NA, 91, 74, 83, 85)
y <- c(90, NA, 79, 86, 84, 83, 88, 92, 76, 75)

#calculate correlation coefficient between x and y
cor(x, y, use='complete.obs')

[1] -0.4888749

The correlation coefficient between the two variables turns out to be -0.488749.

Note that the cor() function only used pairwise combinations where both values were present when calculating the correlation coefficient.

Example 2: Calculate Correlation Matrix with Missing Values Present

Suppose we attempt to use the cor() function to create a for a data frame with three variables when missing values are present:

#create data frame with some missing values
df <- data.frame(x=c(70, 78, 90, 87, 84, NA, 91, 74, 83, 85),
                 y=c(90, NA, 79, 86, 84, 83, 88, 92, 76, 75),
                 z=c(57, 57, 58, 59, 60, 78, 81, 83, NA, 90))

#attempt to create correlation matrix for variables in data frame
cor(df)

   x  y  z
x  1 NA NA
y NA  1 NA
z NA NA  1

To avoid this issue, we can use the argument use=’pairwise.complete.obs’ so that R knows to only use pairwise observations where both values are present:

#create data frame with some missing values
df <- data.frame(x=c(70, 78, 90, 87, 84, NA, 91, 74, 83, 85),
                 y=c(90, NA, 79, 86, 84, 83, 88, 92, 76, 75),
                 z=c(57, 57, 58, 59, 60, 78, 81, 83, NA, 90))

#create correlation matrix for variables using only pairwise complete observations
cor(df, use='pairwise.complete.obs')

           x          y          z
x  1.0000000 -0.4888749  0.1311651
y -0.4888749  1.0000000 -0.1562371
z  0.1311651 -0.1562371  1.0000000

The correlation coefficients for each pairwise combination of variables in the data frame are now shown.

The following tutorials explain how to perform other common tasks in R:

x