How do I create a correlation matrix in PySpark?


A is a square table that shows the between variables in a dataset.

It offers a quick way to understand the strength of the linear relationships that exist between variables in a dataset.

You can use the following syntax to create a correlation matrix from a PySpark DataFrame:

from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler

#conver each DataFrame column to vectors
vector_col = 'corr_features'
assembler = VectorAssembler(inputCols=df.columns, outputCol=vector_col)
df_vector = assembler.transform(df).select(vector_col)

#calculate correlation matrix
corr_matrix = Correlation.corr(df_vector, vector_col)

#display correlation matrix
corr_matrix.collect()[0]['pearson({})'.format(vector_col)].values

This particular code uses the VectorAssembler function to first convert the DataFrame columns to vectors, then uses the Correlation function from pyspark.ml.stat to calculate the correlation matrix.

The following example shows how to use this syntax in practice.

Example: How to Create a Correlation Matrix in PySpark

Suppose we have the following PySpark DataFrame that contains information about assists, rebounds and points for various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[4, 12, 22], 
        [5, 14, 24], 
        [5, 13, 26], 
        [6, 7, 26], 
        [7, 8, 29],
        [8, 8, 32],
        [8, 9, 20],
        [10, 13, 14]]
  
#define column names
columns = ['assists', 'rebounds', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+--------+------+
|assists|rebounds|points|
+-------+--------+------+
|      4|      12|    22|
|      5|      14|    24|
|      5|      13|    26|
|      6|       7|    26|
|      7|       8|    29|
|      8|       8|    32|
|      8|       9|    20|
|     10|      13|    14|
+-------+--------+------+

We can use the following syntax to create a correlation matrix for this DataFrame:

from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler

#conver each DataFrame column to vectors
vector_col = 'corr_features'
assembler = VectorAssembler(inputCols=df.columns, outputCol=vector_col)
df_vector = assembler.transform(df).select(vector_col)

#calculate correlation matrix
corr_matrix = Correlation.corr(df_vector, vector_col)

#display correlation matrix
corr_matrix.collect()[0]['pearson({})'.format(vector_col)].values

array([ 1.        , -0.24486081, -0.32957305, -0.24486081,  1.        ,
       -0.52209174, -0.32957305, -0.52209174,  1.        ])

The correlation coefficients along the diagonal of the table are all equal to 1 because each variable is perfectly correlated with itself.

All of the other correlation coefficients indicate the correlation between different pairwise combinations of variables.

For example:

  • The correlation coefficient between assists and rebounds is -0.245.
  • The correlation coefficient between assists and points  is -0.330.
  • The correlation coefficient between rebounds and points  is -0.522.

Related:

Additional Resources

x