How can a correlation matrix be created in PySpark?

A correlation matrix in PySpark can be created by first importing the necessary libraries, such as PySpark and NumPy. Then, the data must be loaded into a Spark DataFrame. Next, the DataFrame must be converted into a Pandas DataFrame to use the corr() function from the NumPy library. This function will calculate the correlation between all numerical columns in the DataFrame and create a correlation matrix. The resulting matrix can then be converted back to a Spark DataFrame for further analysis or visualization. This process allows for the efficient creation of a correlation matrix in PySpark, which can aid in identifying relationships between variables in a dataset.

Create a Correlation Matrix in PySpark


A is a square table that shows the between variables in a dataset.

It offers a quick way to understand the strength of the linear relationships that exist between variables in a dataset.

You can use the following syntax to create a correlation matrix from a PySpark DataFrame:

from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler

#conver each DataFrame column to vectors
vector_col = 'corr_features'
assembler = VectorAssembler(inputCols=df.columns, outputCol=vector_col)
df_vector = assembler.transform(df).select(vector_col)

#calculate correlation matrix
corr_matrix = Correlation.corr(df_vector, vector_col)

#display correlation matrix
corr_matrix.collect()[0]['pearson({})'.format(vector_col)].values

This particular code uses the VectorAssembler function to first convert the DataFrame columns to vectors, then uses the Correlation function from pyspark.ml.stat to calculate the correlation matrix.

The following example shows how to use this syntax in practice.

Example: How to Create a Correlation Matrix in PySpark

Suppose we have the following PySpark DataFrame that contains information about assists, rebounds and points for various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[4, 12, 22], 
        [5, 14, 24], 
        [5, 13, 26], 
        [6, 7, 26], 
        [7, 8, 29],
        [8, 8, 32],
        [8, 9, 20],
        [10, 13, 14]]
  
#define column names
columns = ['assists', 'rebounds', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+--------+------+
|assists|rebounds|points|
+-------+--------+------+
|      4|      12|    22|
|      5|      14|    24|
|      5|      13|    26|
|      6|       7|    26|
|      7|       8|    29|
|      8|       8|    32|
|      8|       9|    20|
|     10|      13|    14|
+-------+--------+------+

We can use the following syntax to create a correlation matrix for this DataFrame:

from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler

#conver each DataFrame column to vectors
vector_col = 'corr_features'
assembler = VectorAssembler(inputCols=df.columns, outputCol=vector_col)
df_vector = assembler.transform(df).select(vector_col)

#calculate correlation matrix
corr_matrix = Correlation.corr(df_vector, vector_col)

#display correlation matrix
corr_matrix.collect()[0]['pearson({})'.format(vector_col)].values

array([ 1.        , -0.24486081, -0.32957305, -0.24486081,  1.        ,
       -0.52209174, -0.32957305, -0.52209174,  1.        ])

The correlation coefficients along the diagonal of the table are all equal to 1 because each variable is perfectly correlated with itself.

All of the other correlation coefficients indicate the correlation between different pairwise combinations of variables.

For example:

  • The correlation coefficient between assists and rebounds is -0.245.
  • The correlation coefficient between assists and points  is -0.330.
  • The correlation coefficient between rebounds and points  is -0.522.

Related:

Additional Resources

x