Table of Contents
SST, SSR, and SSE can be calculated in Python using the scipy.stats.linregress() function. This function calculates the slope, intercept, r-value, p-value, and standard error of a linear regression, as well as the sum of squares of the error (SSE), the total sum of squares (SST), and the regression sum of squares (SSR). These values can then be used to calculate the coefficient of determination (r-squared), which provides an indication of how well the linear regression model fits the data.
We often use three different values to measure how well a fits a dataset:
1. Sum of Squares Total (SST) – The sum of squared differences between individual data points (yi) and the mean of the response variable (y).
- SST = Σ(yi – y)2
2. Sum of Squares Regression (SSR) – The sum of squared differences between predicted data points (ŷi) and the mean of the response variable(y).
- SSR = Σ(ŷi – y)2
3. Sum of Squares Error (SSE) – The sum of squared differences between predicted data points (ŷi) and observed data points (yi).
- SSE = Σ(ŷi – yi)2
The following step-by-step example shows how to calculate each of these metrics for a given regression model in Python.
Step 1: Create the Data
First, let’s create a dataset that contains the number of hours studied and exam score received for 20 different students at a certain university:
import pandas as pd #create pandas DataFrame df = pd.DataFrame({'hours': [1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 6, 7, 7, 8], 'score': [68, 76, 74, 80, 76, 78, 81, 84, 86, 83, 88, 85, 89, 94, 93, 94, 96, 89, 92, 97]}) #view first five rows of DataFrame df.head() hours score 0 1 68 1 1 76 2 1 74 3 2 80 4 2 76
Step 2: Fit a Regression Model
Next, we’ll use the OLS() function from the library to fit a simple linear regression model using score as the response variable and hours as the predictor variable:
import statsmodels.api as sm #define response variable y = df['score'] #define predictor variable x = df[['hours']] #add constant to predictor variables x = sm.add_constant(x) #fit linear regression model model = sm.OLS(y, x).fit()
Step 3: Calculate SST, SSR, and SSE
Lastly, we can use the following formulas to calculate the SST, SSR, and SSE values of the model:
import numpy as np #calculate sse sse = np.sum((model.fittedvalues - df.score)**2) print(sse) 331.07488479262696 #calculate ssr ssr = np.sum((model.fittedvalues - df.score.mean())**2) print(ssr) 917.4751152073725 #calculate sst sst = ssr + sse print(sst) 1248.5499999999995
- Sum of Squares Total (SST): 1248.55
- Sum of Squares Regression (SSR): 917.4751
- Sum of Squares Error (SSE): 331.0749
We can verify that SST = SSR + SSE:
- SST = SSR + SSE
- 1248.55 = 917.4751 + 331.0749
You can use the following calculators to automatically calculate SST, SSR, and SSE for any simple linear regression line:
The following tutorials explain how to calculate SST, SSR, and SSE in other statistical software: