How can I calculate the correlation between two columns in PySpark?

Calculating the correlation between two columns in PySpark involves using the corr() function from the pyspark.sql.functions library. This function takes in two columns as parameters and computes the correlation coefficient between them, giving a value between -1 and 1. A value of 1 indicates a perfect positive correlation, 0 indicates no correlation, and -1 indicates a perfect negative correlation. This method is useful for analyzing the relationship between two variables in a dataset and can provide valuable insights for further data analysis.

PySpark: Calculate Correlation Between Two Columns


The helps us quantify the strength and direction of the linear relationship between two variables.

To calculate the correlation coefficient between two columns in a PySpark DataFrame, you can use the following syntax:

df.stat.corr('column1', 'column2')

This particular code will return a value between -1 and 1 that represents the Pearson correlation coefficient between column1 and column2.

The following example shows how to use this syntax in practice.

Example: Calculate Correlation Between Two Columns in PySpark

Suppose we have the following PySpark DataFrame that contains information about assists, rebounds and points for various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[4, 12, 22], 
        [5, 14, 24], 
        [5, 13, 26], 
        [6, 7, 26], 
        [7, 8, 29],
        [8, 8, 32],
        [8, 9, 20],
        [10, 13, 14]]
  
#define column names
columns = ['assists', 'rebounds', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+--------+------+
|assists|rebounds|points|
+-------+--------+------+
|      4|      12|    22|
|      5|      14|    24|
|      5|      13|    26|
|      6|       7|    26|
|      7|       8|    29|
|      8|       8|    32|
|      8|       9|    20|
|     10|      13|    14|
+-------+--------+------+

We can use the following syntax to calculate the correlation between the assists and points columns in the DataFrame

#calculate correlation between assists and points columns
df.stat.corr('assists', 'points')

-0.32957304910500873

The correlation coefficient turns out to be -0.32957.

Since this value is negative, it tells us that there is a negative association between the two variables.

In other words, when the value for assists increases, the value for points tends to decrease.

And when the value for assists decreases, the value for points tends to increase.

Feel free to replace assists and points with whatever column names you’d like to calculate the correlation coefficient between two different columns.

Related:

Additional Resources

x