Table of Contents

In order to calculate the p-value for a correlation coefficient in Pandas, you can use the .corr() method to compute the correlation coefficient, and then use the .pvalue() method to calculate the p-value. The p-value of a correlation coefficient indicates the probability of obtaining the observed correlation coefficient if the null hypothesis of no correlation is true. The lower the p-value, the stronger the evidence against the null hypothesis, indicating a stronger correlation between the two variables.

The can be used to measure the linear association between two variables.

This correlation coefficient always takes on a value between **-1** and **1** where:

**-1**: Perfectly negative linear correlation between two variables.**0**: No linear correlation between two variables.**1:**Perfectly positive linear correlation between two variables.

To determine if a correlation coefficient is statistically significant, you can calculate the corresponding t-score and p-value.

The formula to calculate the t-score of a correlation coefficient (r) is:

t = r√n-2 / √1-r^{2}

The p-value is calculated as the corresponding two-sided p-value for the t-distribution with n-2 degrees of freedom.

To calculate the p-value for a Pearson correlation coefficient in pandas, you can use the **pearsonr()** function from the **SciPy** library:

from scipy.stats import pearsonr pearsonr(df['column1'], df['column2'])

This function will return the Pearson correlation coefficient between columns **column1** and **column2** along with the corresponding p-value that tells us whether or not the correlation coefficient is statistically significant.

If you would like to calculate the p-value for the Pearson correlation coefficient of each possible pairwise combination of columns in a DataFrame, you can use the following custom function to do so:

**def r_pvalues(df):
cols = pd.DataFrame(columns=df.columns)
p = cols.transpose().join(cols, how='outer')
for r in df.columns:
for c in df.columns:
tmp = df[df[r].notnull() & df[c].notnull()]
p[r][c] = round(pearsonr(tmp[r], tmp[c])[1], 4)
return p
**

The following examples show how to calculate p-values for correlation coefficients in practice with the following pandas DataFrame:

**import pandas as pd
#create DataFrame
df = pd.DataFrame({'x': [4, 5, 5, 7, 8, 10, 12, 13, 14, 15],
'y': [10, 12, 14, 18, np.nan, 19, 13, 20, 14, np.nan],
'z': [20, 24, 24, 23, 19, 15, 18, 14, 10, 12]})
#view DataFrame
print(df)
x y z
0 4 10.0 20
1 5 12.0 24
2 5 14.0 24
3 7 18.0 23
4 8 NaN 19
5 10 19.0 15
6 12 13.0 18
7 13 20.0 14
8 14 14.0 10
9 15 NaN 12
**

**Example 1: Calculate P-Value for Correlation Coefficient Between Two Columns in Pandas**

The following code shows how to calculate the Pearson correlation coefficient and corresponding p-value for the **x** and **y** columns in the DataFrame:

**from scipy.stats import pearsonr
#drop all rows with NaN values
df_new = df.dropna()
#calculation correlation coefficient and p-value between x and y
pearsonr(df_new['x'], df_new['y'])
PearsonRResult(statistic=0.4791621985883838, pvalue=0.22961622926360523)
**

- The Pearson correlation coefficient is
**0.4792**. - The corresponding p-value is
**0.2296**.

Since the correlation coefficient is positive, it indicates that there is a positive linear relationship between the two variables.

However, since the p-value of the correlation coefficient is not less than 0.05, the correlation is not statistically significant.

Note that we can also use the following syntax to extract the p-value for the correlation coefficient:

**#extract p-value of correlation coefficient
pearsonr(df_new['x'], df_new['y'])[1]
0.22961622926360523
**

The p-value for the correlation coefficient is **0.2296**.

This matches the p-value from the previous output.

**Example 2: Calculate P-Value for Correlation Coefficient Between All Columns in Pandas**

The following code shows how to calculate the Pearson correlation coefficient and corresponding p-value for each pairwise combination of columns in the pandas DataFrame:

**#create function to calculate p-values for each pairwise correlation coefficient
def r_pvalues(df):
cols = pd.DataFrame(columns=df.columns)
p = cols.transpose().join(cols, how='outer')
for r in df.columns:
for c in df.columns:
tmp = df[df[r].notnull() & df[c].notnull()]
p[r][c] = round(pearsonr(tmp[r], tmp[c])[1], 4)
return p
#use custom function to calculate p-values
r_pvalues(df)
x y z
x 0.0 0.2296 0.0005
y 0.2296 0.0 0.4238
z 0.0005 0.4238 0.0**

From the output we can see:

- The p-value for the correlation coefficient between x and y is
**0.2296**. - The p-value for the correlation coefficient between x and z is
**0.0005**. - The p-value for the correlation coefficient between y and z is
**0.4238**.

Note that we rounded the p-values to four decimal places in our custom function.

Feel free to change the **4** in the last line of the function to a different number to round to a different number of decimal places.

**Note**: You can find the complete documentation for the SciPy **pearsonr()** function .