How can I calculate the mean of a column in PySpark?

Calculating the mean of a column in PySpark can be achieved by using the aggregate function provided by the DataFrame API. This function takes in the name of the column and the type of aggregation, in this case “mean”, and returns the average value of the column. Alternatively, the describe function can also be used to get the mean value along with other descriptive statistics for the column. These methods allow for efficient and accurate calculation of the mean in PySpark, making it a useful tool for data analysis and processing.

Calculate the Mean of a Column in PySpark


You can use the following methods to calculate the mean of a column in a PySpark DataFrame:

Method 1: Calculate Mean for One Specific Column

from pyspark.sql import functions as F

#calculate mean of column named 'game1'
df.agg(F.mean('game1')).collect()[0][0]

Method 2: Calculate Mean for Multiple Columns

from pyspark.sql.functions import mean

#calculate mean for game1, game2 and game3 columns
df.select(mean(df.game1), mean(df.game2), mean(df.game3)).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 25, 11, 10], 
        ['Nets', 22, 8, 14], 
        ['Hawks', 14, 22, 10], 
        ['Kings', 30, 22, 35], 
        ['Bulls', 15, 14, 12], 
        ['Blazers', 10, 14, 18]] 
  
#define column names
columns = ['team', 'game1', 'game2', 'game3'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+-----+-----+-----+
|   team|game1|game2|game3|
+-------+-----+-----+-----+
|   Mavs|   25|   11|   10|
|   Nets|   22|    8|   14|
|  Hawks|   14|   22|   10|
|  Kings|   30|   22|   35|
|  Bulls|   15|   14|   12|
|Blazers|   10|   14|   18|
+-------+-----+-----+-----+

Example 1: Calculate Mean for One Specific Column

We can use the following syntax to calculate the mean of values in the game1 column of the DataFrame only:

from pyspark.sql import functions as F

#calculate mean of column named 'game1'
df.agg(F.mean('game1')).collect()[0][0]

19.333333333333332

The mean of values in the game1 column turns out to be 19.333.

We can verify this is correct by manually calculating the mean of values in this column:

Mean of values in game1: (25 + 22 + 14 + 30 + 15 + 10) / 6 = 19.333.

Example 2: Calculate Mean for Multiple Columns

We can use the following syntax to calculate the mean of values for the game1, game2 and game3 columns of the DataFrame:

from pyspark.sql.functions import mean

#calculate mean for game1, game2 and game3 columns
df.select(mean(df.game1), mean(df.game2), mean(df.game3)).show()

+------------------+------------------+----------+
|        avg(game1)|        avg(game2)|avg(game3)|
+------------------+------------------+----------+
|19.333333333333332|15.166666666666666|      16.5|
+------------------+------------------+----------+

From the output we can see:

  • The mean of values in the game1 column is 19.333.
  • The mean of values in the game2 column is 15.167.
  • The mean of values in the game3 column is 16.5.

Note: If there are null values in the column, the mean function will ignore these values by default.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x