How to Calculate Correlation By Group in Pandas

Pandas provides a method to calculate the correlation between different groups of a data set. This method is known as ‘groupby’ and it allows us to group and calculate the correlation between different variables in the same data set. To use this method, first select the columns to be grouped and then use the ‘groupby’ method to group them. Finally, use the ‘corr’ method to calculate the correlation between the different groups. This method is useful for identifying the relationships between different data sets and can be used for both numerical and categorical data.


You can use the following basic syntax to calculate the correlation between two variables by group in pandas:

df.groupby('group_var')[['values1','values2']].corr().unstack().iloc[:,1]

The following example shows how to use this syntax in practice.

Example: Calculate Correlation By Group in Pandas

Suppose we have the following pandas DataFrame:

import pandas as pd

#create DataFrame
df = pd.DataFrame({'team': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
                   'points': [18, 22, 19, 14, 14, 11, 20, 28],
                   'assists': [2, 7, 9, 3, 12, 10, 14, 21]})

#view DataFrame
print(df)

We can use the following code to calculate the correlation between points and assists, grouped by team:

#calculate correlation between points and assists, grouped by team
df.groupby('team')[['points','assists']].corr().unstack().iloc[:,1]

team
A    0.603053
B    0.981798
Name: (points, assists), dtype: float64

From the output we can see:

  • The correlation coefficient between points and assists for team A is .603053.
  • The correlation coefficient between points and assists for team B is .981798.

Since both correlation coefficients are positive, this tells us that the relationship between points and assists for both teams is positive.

That is, players who tend to score more points also tend to record more assists.

Related: 

Note that we could shorten the syntax by not using the unstack and iloc functions, but the results are uglier:

df.groupby('team')[['points','assists']].corr()

		points	  assists
team			
A	points	1.000000  0.603053
        assists	0.603053  1.000000
B	points	1.000000  0.981798
        assists	0.981798  1.000000

This syntax produces a correlation matrix for both teams, which provides us with excessive information.

x