How can I perform polynomial regression in R using a step-by-step approach?

Polynomial regression is a statistical method used to analyze and predict the relationship between a dependent variable and one or more independent variables. In order to perform polynomial regression in R, a step-by-step approach can be followed. First, the data must be imported into R and organized into a data frame. Then, the lm() function can be used to create a linear model with the dependent variable and desired independent variables. Next, the poly() function can be used to create polynomial terms for the independent variables. The resulting polynomial terms can then be added to the linear model using the formula notation. Once the model is created, it can be visualized using the plot() function and evaluated for accuracy using the summary() function. This step-by-step approach in R allows for efficient and accurate implementation of polynomial regression analysis.

Polynomial Regression in R (Step-by-Step)


Polynomial regression is a technique we can use when the relationship between a predictor variable and a response variable is nonlinear. 

This type of regression takes the form:

Y = β0 + β1X + β2X2 + … + βhXh + ε

where h is  the “degree” of the polynomial.

This tutorial provides a step-by-step example of how to perform polynomial regression in R.

Step 1: Create the Data

For this example we’ll create a dataset that contains the number of hours studied and final exam score for a class of 50 students:

#make this example reproducible
set.seed(1)

#create dataset
df <- data.frame(hours = runif(50, 5, 15), score=50)
df$score = df$score + df$hours^3/150 + df$hours*runif(50, 1, 2)

#view first six rows of data
head(data)

      hours    score
1  7.655087 64.30191
2  8.721239 70.65430
3 10.728534 73.66114
4 14.082078 86.14630
5  7.016819 59.81595
6 13.983897 83.60510

Step 2: Visualize the Data

Before we fit a regression model to the data, let’s first create a scatterplot to visualize the relationship between hours studied and exam score:

library(ggplot2)

ggplot(df, aes(x=hours, y=score)) +
  geom_point()

We can see that the data exhibits a bit of a quadratic relationship, which indicates that polynomial regression could fit the data better than simple linear regression.

Step 3: Fit the Polynomial Regression Models

Next, we’ll fit five different polynomial regression models with degrees h = 1…5 and use k-fold cross-validation with k=10 folds to calculate the test MSE for each model:

#randomly shuffle data
df.shuffled <- df[sample(nrow(df)),]

#define number of folds to use for k-fold cross-validation
K <- 10 

#define degree of polynomials to fit
degree <- 5

#create k equal-sized folds
folds <- cut(seq(1,nrow(df.shuffled)),breaks=K,labels=FALSE)

#create object to hold MSE's of models
mse = matrix(data=NA,nrow=K,ncol=degree)

#Perform K-fold cross validation
for(i in 1:K){
    
    #define training and testing data
    testIndexes <- which(folds==i,arr.ind=TRUE)
    testData <- df.shuffled[testIndexes, ]
    trainData <- df.shuffled[-testIndexes, ]
    
    #use k-fold cv to evaluate models
    for (j in 1:degree){
        fit.train = lm(score ~ poly(hours,j), data=trainData)
        fit.test = predict(fit.train, newdata=testData)
        mse[i,j] = mean((fit.test-testData$score)^2) 
    }
}

#find MSE for each degree 
colMeans(mse)

[1]  9.802397  8.748666  9.601865 10.592569 13.545547

From the output we can see the test MSE for each model:

  • Test MSE with degree h = 1: 9.80
  • Test MSE with degree h = 2: 8.75
  • Test MSE with degree h = 3: 9.60
  • Test MSE with degree h = 4: 10.59
  • Test MSE with degree h = 5: 13.55

The model with the lowest test MSE turned out to be the polynomial regression model with degree h =2.

This matches our intuition from the original scatterplot: A quadratic regression model fits the data best.

Step 4: Analyze the Final Model

Lastly, we can obtain the coefficients of the best performing model:

#fit best model
best = lm(score ~ poly(hours,2, raw=T), data=df)

#view summary of best model
summary(best)

Call:
lm(formula = score ~ poly(hours, 2, raw = T), data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6589 -2.0770 -0.4599  2.5923  4.5122 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)              54.00526    5.52855   9.768 6.78e-13 ***
poly(hours, 2, raw = T)1 -0.07904    1.15413  -0.068  0.94569    
poly(hours, 2, raw = T)2  0.18596    0.05724   3.249  0.00214 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output we can see that the final fitted model is:

Score = 54.00526 – .07904*(hours) + .18596*(hours)2

We can use this equation to estimate the score that a student will receive based on the number of hours they studied.

For example, a student who studies for 10 hours is expected to receive a score of 71.81:

Score = 54.00526 – .07904*(10) + .18596*(10)2 = 71.81

We can also plot the fitted model to see how well it fits the raw data:

ggplot(df, aes(x=hours, y=score)) + 
          geom_point() +
          stat_smooth(method='lm', formula = y ~ poly(x,2), size = 1) + 
          xlab('Hours Studied') +
          ylab('Score')

Polynomial regression in R

You can find the complete R code used in this example here.

x