How to Calculate Correlation Between Continuous & Categorical Variables

When analyzing the relationship between continuous and categorical variables, correlation measures can be used to determine the strength and direction of the relationship. Correlation can be calculated using the Pearson product-moment correlation coefficient, which measures the linear relationship between two variables. The coefficient is calculated by determining the variance between the two variables and dividing it by the product of standard deviations for each variable. A correlation coefficient of 1 indicates a perfect positive linear relationship, while a correlation coefficient of -1 indicates a perfect negative linear relationship. A coefficient close to 0 indicates that there is no linear relationship between the two variables.


When we would like to calculate the correlation between two continuous variables, we typically use the .

However, when we would like to calculate the correlation between a continuous variable and a , we can use something known as point biserial correlation.

Point biserial correlation is used to calculate the correlation between a binary categorical variable (a variable that can only take on two values) and a continuous variable and has the following properties:

  • Point biserial correlation can range between -1 and 1.
  • For each group created by the binary variable, it is assumed that the continuous variable is normally distributed with equal variances.
  • For each group created by the binary variable, it is assumed that there are no extreme outliers.

The following example shows how to calculate a point biserial correlation in practice.

Example: Calculating a Point Biserial Correlation

Suppose a college professor would like to determine if there is a correlation between gender and score on particular aptitude exam.

He collects the following data on 12 males and 12 females in his class:

Since gender is a categorical variable and score is a continuous variable, it makes sense to calculate a point-biserial correlation between the two variables.

The professor can use any statistical software (including Excel, R, Python, SPSS, Stata) to calculate the point-biserial correlation between the two variables.

The following code shows how to calculate the point-biserial correlation in R, using the value 0 to represent females and 1 to represent males for the gender variable:

#define values for gender
gender <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)

#define values for score
score <- c(77, 78, 79, 79, 82, 84, 85, 88, 89, 91, 91, 94,
           84, 84, 84, 85, 85, 86, 86, 86, 89, 91, 94, 98)

#calculate point-biserial correlation
cor.test(gender, score)

	Pearson's product-moment correlation

data:  gender and score
t = 1.3739, df = 22, p-value = 0.1833
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.1379386  0.6147832
sample estimates:
      cor 
0.2810996 

From the output we can see that the point biserial correlation coefficient is 0.281 and the corresponding p-value is 0.1833.

Since the correlation coefficient is positive, it tells us that there is a positive correlation between gender and score.

Since we coded the males as 1 and females as 0, this indicates that scores tend to be higher for males (i.e. scores tend to increase as gender “increases” from 0 to 1)

However, since the p-value is not less than .05, this correlation coefficient is not statistically significant.

The following tutorials explain how to calculate point biserial correlation using different statistical software:

x