How can I use PySpark to calculate the correlation between two columns?

PySpark is an open-source framework that enables users to perform data analysis and processing on large datasets using Python. To calculate the correlation between two columns in PySpark, we can use the corr() function which takes in two column names as parameters and returns their correlation coefficient. This helps us understand the relationship between the two columns and identify any patterns or trends in the data. Additionally, PySpark’s distributed computing capabilities allow for efficient processing of large datasets, making it a powerful tool for calculating correlations in big data.


The helps us quantify the strength and direction of the linear relationship between two variables.

To calculate the correlation coefficient between two columns in a PySpark DataFrame, you can use the following syntax:

df.stat.corr('column1', 'column2')

This particular code will return a value between -1 and 1 that represents the Pearson correlation coefficient between column1 and column2.

The following example shows how to use this syntax in practice.

Example: Calculate Correlation Between Two Columns in PySpark

Suppose we have the following PySpark DataFrame that contains information about assists, rebounds and points for various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[4, 12, 22], 
        [5, 14, 24], 
        [5, 13, 26], 
        [6, 7, 26], 
        [7, 8, 29],
        [8, 8, 32],
        [8, 9, 20],
        [10, 13, 14]]
  
#define column names
columns = ['assists', 'rebounds', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+--------+------+
|assists|rebounds|points|
+-------+--------+------+
|      4|      12|    22|
|      5|      14|    24|
|      5|      13|    26|
|      6|       7|    26|
|      7|       8|    29|
|      8|       8|    32|
|      8|       9|    20|
|     10|      13|    14|
+-------+--------+------+

We can use the following syntax to calculate the correlation between the assists and points columns in the DataFrame

#calculate correlation between assists and points columns
df.stat.corr('assists', 'points')

-0.32957304910500873

The correlation coefficient turns out to be -0.32957.

Since this value is negative, it tells us that there is a negative association between the two variables.

In other words, when the value for assists increases, the value for points tends to decrease.

And when the value for assists decreases, the value for points tends to increase.

Feel free to replace assists and points with whatever column names you’d like to calculate the correlation coefficient between two different columns.

Related:

Additional Resources

x