Table of Contents
Point-Biserial correlation in Python can be calculated using the scipy.stats.pointbiserialr() function. This function takes two arguments, x and y, which are two arrays of the same length, containing the numerical and categorical values. It then returns a correlation coefficient and a p-value, which can be used to interpret the strength of the correlation between the two variables.
Point-biserial correlation is used to measure the relationship between a binary variable, x, and a continuous variable, y.
Similar to the , the point-biserial correlation coefficient takes on a value between -1 and 1 where:
- -1 indicates a perfectly negative correlation between two variables
- 0 indicates no correlation between two variables
- 1 indicates a perfectly positive correlation between two variables
This tutorial explains how to calculate the point-biserial correlation between two variables in Python.
Example: Point-Biserial Correlation in Python
Suppose we have a binary variable, x, and a continuous variable, y:
x = [0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0] y = [12, 14, 17, 17, 11, 22, 23, 11, 19, 8, 12]
We can use the function from the scipy.stats library to calculate the point-biserial correlation between the two variables.
Note that this function returns a correlation coefficient along with a corresponding p-value:
import scipy.stats as stats #calculate point-biserial correlation stats.pointbiserialr(x, y) PointbiserialrResult(correlation=0.21816, pvalue=0.51928)
The point-biserial correlation coefficient is 0.21816 and the corresponding p-value is 0.51928.
Since the correlation coefficient is positive, this indicates that when the variable x takes on the value “1” that the variable y tends to take on higher values compared to when the variable x takes on the value “0.”
Since the p-value of this correlation is not less than .05, this correlation is not statistically significant.
You can find the exact details of how this correlation is calculated in the scipy.stats .