How can I calculate the conditional mean in PySpark?

Calculating the conditional mean in PySpark refers to the process of finding the average value of a specific variable in a dataset, given a certain condition or criteria. This can be achieved by using the PySpark library’s built-in functions such as “groupBy” and “agg”, which allow for grouping and aggregating data based on specified conditions. The resulting mean value can provide valuable insights into the relationship between different variables in the dataset and can help in making informed decisions in data analysis and machine learning tasks.

Calculate Conditional Mean in PySpark


You can use the following methods to calculate a conditional mean in a PySpark DataFrame:

Method 1: Calculate Conditional Mean for String Variable

from pyspark.sql import functions as F

df.filter(df.team=='A').agg(F.mean('points').alias('mean_points')).show()

This particular example calculates the mean value of the points column only for the rows where the value in the team column is equal to A.

Method 2: Calculate Conditional Mean for Numeric Variable

from pyspark.sql import functions as F

df.filter(df.points>10).agg(F.mean('assists').alias('mean_assists')).show()

This particular example calculates the mean value of the assists column only for the rows where the value in the points column is greater than 10.

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11, 4], 
        ['A', 'Guard', 8, 5], 
        ['A', 'Forward', 22, 5], 
        ['A', 'Forward', 22, 9], 
        ['B', 'Guard', 14, 12], 
        ['B', 'Guard', 14, 3],
        ['B', 'Forward', 13, 5],
        ['B', 'Forward', 7, 2]] 
  
#define column names
columns = ['team', 'position', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+-------+
|team|position|points|assists|
+----+--------+------+-------+
|   A|   Guard|    11|      4|
|   A|   Guard|     8|      5|
|   A| Forward|    22|      5|
|   A| Forward|    22|      9|
|   B|   Guard|    14|     12|
|   B|   Guard|    14|      3|
|   B| Forward|    13|      5|
|   B| Forward|     7|      2|
+----+--------+------+-------+

Example 1: Calculate Conditional Mean for String Variable

We can use the following syntax to calculate the mean value in the points column only for the rows where the corresponding value in the team column is equal to A:

from pyspark.sql import functions as F

#calculate mean value in points column for rows where team column is equal to 'A'
df.filter(df.team=='A').agg(F.mean('points').alias('mean_points')).show()

+-----------+
|mean_points|
+-----------+
|      15.75|
+-----------+

We can see that the mean value in the points column among players on team A is 15.75.

Note: We used the alias function to rename the column in the resulting DataFrame to mean_points.

Example 2: Calculate Conditional Mean for Numeric Variable

We can use the following syntax to calculate the mean value in the assists column only for the rows where the corresponding value in the points column is greater than 10:

from pyspark.sql import functions as F

#calculate mean value in assists column for rows where points is greater than 10
df.filter(df.points>10).agg(F.mean('assists').alias('mean_assists')).show() 

+-----------------+
|     mean_assists|
+-----------------+
|6.333333333333333|
+-----------------+

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x