How to Calculate Partial Correlation in Python?

Partial correlation in Python can be calculated by first creating a correlation matrix of the dataset, and then using the partial_corr() function from the statsmodels library to calculate the partial correlation values between two variables, while controlling for the effect of the other variables in the dataset.


In statistics, we often use the  to measure the linear relationship between two variables. However, sometimes we’re interested in understanding the relationship between two variables while controlling for a third variable.

For example, suppose we want to measure the association between the number of hours a student studies and the final exam score they receive, while controlling for the student’s current grade in the class. In this case, we could use a partial correlation to measure the relationship between hours studied and final exam score.

This tutorial explains how to calculate partial correlation in Python.

Example: Partial Correlation in Python

Suppose we have the following Pandas DataFrame that displays the current grade, total hours studied, and final exam score for 10 students:

import numpy as np
import panda as pd

data = {'currentGrade':  [82, 88, 75, 74, 93, 97, 83, 90, 90, 80],
        'hours': [4, 3, 6, 5, 4, 5, 8, 7, 4, 6],
        'examScore': [88, 85, 76, 70, 92, 94, 89, 85, 90, 93],
        }

df = pd.DataFrame(data, columns = ['currentGrade','hours', 'examScore'])
df

   currentGrade  hours  examScore
0            82      4         88
1            88      3         85
2            75      6         76
3            74      5         70
4            93      4         92
5            97      5         94
6            83      8         89
7            90      7         85
8            90      4         90
9            80      6         93

To calculate the partial correlation between hours and examScore while controlling for currentGrade, we can use the partial_corr() function from the , which uses the following syntax:

partial_corr(data, x, y, covar)

where:

  • data: name of the dataframe
  • x, y: names of columns in the dataframe
  • covar: the name of the covariate column in the dataframe (e.g. the variable you’re controlling for)

Here is how to use this function in this particular example:

#install and import pingouin package 
pip install pingouin
import pingouin as pg

#find partial correlation between hours and exam score while controlling for grade
pg.partial_corr(data=df, x='hours', y='examScore', covar='currentGrade')


         n	    r	       CI95%	   r2	adj_r2	p-val	 BF10	power
pearson	10	0.191	[-0.5, 0.73]	0.036	-0.238	0.598	0.438	0.082

We can see that the partial correlation between hours studied and final exam score is .191, which is a small positive correlation. As hours studied increases, exam score tends to increase as well, assuming current grade is held constant.

To calculate the partial correlation between multiple variables at once, we can use the .pcorr() function:

#calculate all pairwise partial correlations, rounded to three decimal places
df.pcorr().round(3)

	     currentGrade	hours	examScore
currentGrade	    1.000      -0.311	    0.736
hours	           -0.311	1.000	    0.191
examScore	    0.736	0.191	    1.000

The way to interpret the output is as follows:

  • The partial correlation between current grade and hours studied is -0.311.
  • The partial correlation between current grade and exam score 0.736.
  • The partial correlation between hours studied and exam score 0.191.
x