What is the PRESS Statistic

The PRESS statistic (Predicted Residual Error Sum of Squares) is a measure of the accuracy of a model’s predictions. This statistic is calculated by summing the squared differences between the predicted values and the actual values for a set of data. PRESS can be used to compare different models and determine which one is the most accurate. It can also be used to detect overfitting in a model.


In statistics, we fit regression models for two reasons:

(1) To explain the relationship between one or more explanatory variables and a response variable.

(2) To predict values of a response variable based on the values of one or more explanatory variables.

When our goal is to (2) predict the values of a response variable, we want to make sure that we’re using the best possible regression model to do so.

One metric that we can use to find the regression model that will make the best predictions on new data is the PRESS Statistic, which stands for the “Predicted REsidual Sum of Squares.”

It is calculated as:

PRESS = Σ(ei / (1-hii))2

where:

  • ei: The ith residual.
  • hiiA measure of the influence (also called “leverage”) of the ith observation on the model fit.

Given several regression models, the one with the lowest PRESS should be selected as the one that will perform best on a new dataset.

The following example shows how to calculate the PRESS statistic for three different linear regression models in R.

Example: Calculating the PRESS Statistic

Suppose we have a dataset with three explanatory variables, x1, x2, and x3, and one response variable y:

data <- data.frame(x1 = c(2, 3, 3, 4, 4, 6, 8, 9, 9, 9),
                   x2 = c(2, 2, 3, 3, 2, 3, 5, 6, 6, 7),
                   x3 = c(12, 14, 14, 13, 8, 8, 9, 14, 11, 7),
                    y = c(23, 24, 15, 9, 14, 17, 22, 26, 34, 35))

The following code shows how to fit three different regression models to this dataset using the lm() function:

model1 <- lm(y~x1, data=data)

model2 <- lm(y~x1+x2, data=data)

model3 <- lm(y~x2+x3, data=data)

The following code shows how to calculate the PRESS statistic for each model.

#create custom function to calculate the PRESS statistic
PRESS <- function(model) {
    i <- residuals(model)/(1 - lm.influence(model)$hat)
    sum(i^2)
}

#calculate PRESS for model 1
PRESS(model1)

[1] 590.2197

#calculate PRESS for model 2
PRESS(model2)

[1] 519.6435

#calculate PRESS for model 3
PRESS(model3)

[1] 537.7503

It turns out that the model with the lowest PRESS statistic is model 2 with a PRESS statistic of 519.6435. Thus, we would choose this model as the one that is best suited to make predictions on a new dataset.

Introduction to Simple Linear Regression
What is a Parsimonious Model?
What is a Good R-squared Value?

x