How can I calculate the mean of multiple columns in PySpark?

Calculating the mean of multiple columns in PySpark refers to the process of finding the average value of multiple numerical columns in a PySpark dataframe. This can be achieved by using the “agg” function and specifying the “mean” method, which will compute the mean value for each specified column. The result will be a new dataframe with the mean values for each column. This method is useful for analyzing and summarizing large datasets in PySpark, as it provides a quick and efficient way to calculate the mean of multiple columns simultaneously.

Calculate Mean of Multiple Columns in PySpark


You can use the following syntax to calculate the mean value across multiple columns in a PySpark DataFrame:

from pyspark.sql import functions as F

#define columns to calculate mean for
mean_cols = ['game1','game2','game3']

#define function to calculate mean
find_mean =  F.expr('+'.join(mean_cols))/len(mean_cols)

#calculate mean across specific columns
df_new = df.withColumn('mean', find_mean)

This particular example creates a new column called mean that contains the mean of values across the game1, game2 and game3 columns in the DataFrame.

The following example shows how to use this syntax in practice.

Example: How to Calculate Mean of Multiple Columns in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players during three different games:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 25, 11, 10], 
        ['Nets', 22, 8, 14], 
        ['Hawks', 14, 22, 10], 
        ['Kings', 30, 22, 35], 
        ['Bulls', 15, 14, 12], 
        ['Blazers', 10, 14, 18]] 
  
#define column names
columns = ['team', 'game1', 'game2', 'game3'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+-----+-----+-----+
|   team|game1|game2|game3|
+-------+-----+-----+-----+
|   Mavs|   25|   11|   10|
|   Nets|   22|    8|   14|
|  Hawks|   14|   22|   10|
|  Kings|   30|   22|   35|
|  Bulls|   15|   14|   12|
|Blazers|   10|   14|   18|
+-------+-----+-----+-----+

Suppose we would like to add a new column call mean that contains the mean of points scored by each player across all three games.

We can use the following syntax to do so:

from pyspark.sql import functions as F

#define columns to calculate mean for
mean_cols = ['game1','game2','game3']

#define function to calculate mean
find_mean =  F.expr('+'.join(mean_cols))/len(mean_cols)

#calculate mean across specific columns
df_new = df.withColumn('mean', find_mean)

#view new DataFrame
df_new.show()

+-------+-----+-----+-----+------------------+
|   team|game1|game2|game3|              mean|
+-------+-----+-----+-----+------------------+
|   Mavs|   25|   11|   10|15.333333333333334|
|   Nets|   22|    8|   14|14.666666666666666|
|  Hawks|   14|   22|   10|15.333333333333334|
|  Kings|   30|   22|   35|              29.0|
|  Bulls|   15|   14|   12|13.666666666666666|
|Blazers|   10|   14|   18|              14.0|
+-------+-----+-----+-----+------------------+

Notice that the new mean column contains the mean of values across the game1, game2 and game3 columns.

For example:

  • The mean of points for the Mavs player is (25 + 11 + 10) / 3 = 15.33
  • The mean of points for the Nets player is (22 + 8 + 14) / 3 = 14.67
  • The mean of points for the Hawks player is (14 + 22 + 10) / 3 = 15.33

And so on.

Note that we used the withColumn function to return a new DataFrame with the mean column added and all other columns left the same.

You can find the complete documentation for the PySpark withColumn function .

Additional Resources

x