How can Linear Regression be performed in PySpark?


Linear regression is a method we can use to quantify the relationship between one or more predictor variables and a .

The following step-by-step example shows how to fit a linear regression model to a dataset in PySpark.

Step 1: Create the Data

First, let’s create the following PySpark DataFrame that contains information about hours spent studying, number of prep exams taken, and final exam score for various students at some university:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[1, 1, 76],
        [2, 3, 78],
        [2, 3, 85],
        [4, 5, 88],
        [2, 2, 72],
        [1, 2, 69],
        [5, 1, 94],
        [4, 1, 94],
        [2, 0, 88],
        [4, 3, 92],
        [4, 4, 90],
        [3, 3, 75],
        [6, 2, 96],
        [5, 4, 90],
        [3, 4, 82],
        [4, 4, 85],
        [6, 5, 99],
        [2, 1, 83],
        [1, 0, 62],
        [2, 1, 76]]
  
#define column names
columns = ['hours', 'prep_exams', 'score'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view first five rows of dataframe
df.limit(5).show()

+-----+----------+-----+
|hours|prep_exams|score|
+-----+----------+-----+
|    1|         1|   76|
|    2|         3|   78|
|    2|         3|   85|
|    4|         5|   88|
|    2|         2|   72|
+-----+----------+-----+

We will fit the following multiple linear regression model to this dataset:

Exam Score = β0 + β1(hours) + β2(prep exams)

Step 2: Format the Data

Next, we’ll use the VectorAssembler function to convert the columns of the DataFrame to vectors, which is needed to fit a linear regression model in PySpark:

from pyspark.ml.feature import VectorAssembler

#specify predictor variables
assembler = VectorAssembler(inputCols=['hours', 'prep_exams'], outputCol='features')

#format DataFrame for linear regression
data = assembler.transform(df)
data = data.select('features', 'score')

Note that we specified hours and prep_exams should be used as the predictor variables in the model and score should be used as the response variable.

Step 3: Fit the Linear Regression Model

Next, we’ll use the LinearRegression function to fit the linear regression model to the data:

from pyspark.ml.regression import LinearRegression

#specify linear regression model to use
lin_reg = LinearRegression(featuresCol='features',
                           labelCol='score',
                           predictionCol='pred_score')

#fit linear regression model to data
fit = lin_reg.fit(data)

Step 4: View Model Summary

Lastly, we can use the following syntax to view the intercept and regression coefficients of the model along with the p-values for each coefficient and the value of the model:

#view model intercept and coefficients
print(fit.intercept, fit.coefficients)

67.67352554133275 [5.555748295250626,-0.6016868046417222]

#view p-values of intercept and coefficents
print(fit.summary.pValues)

[1.0106866433545747e-05, 0.5193352264909648, 1.4654943925052066e-14]

#view RMSE of model
print(fit.summary.r2)

0.7340272170388176
  • Intercept: 67.674
  • Coefficient of Hours: 5.556
  • Coefficient of Prep Exams: -0.602

We can use these values to write the fitted regression equation:

Exam Score = 67.674 + 5.556(hours) – 0.602(prep_exams)

We can use this equation to find the estimated exam score for a student, based on the number of hours they studied and the number of prep exams they took.

For example, a student that studies for 3 hours and takes 2 prep exams is expected to receive an exam score of 83.1:

Estimated exam score = 67.674 + 5.556*(3) – .602*(2) = 83.1 

From the output we can also see the p-values for the intercept and coefficients in the model:

  • P-value of Intercept: <.001
  • P-value of Hours: 0.519
  • P-value of Prep Exams: <.0001

Since the p-value for prep exams is less than .05, this indicates that it has a statistically significant association with exam score.

However, the p-value for hours is not less than .05, which indicates that it does not have a statistically significant association with exam score.

We may decide to remove prep exams from the model since it isn’t statistically significant and instead perform simple linear regression using hours studied as the only predictor variable.

Lastly, we can see the R-squared value for the model:

  • R-squared of model: 0.734

This tells us that 73.4% of the variation in exam scores can be explained by the number of hours studied and number of prep exams taken. 

Note: Refer to the PySpark for a complete list of regression summary metrics you can view.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x