How do you calculate Standard Deviation in PySpark?

Standard deviation is a measure of how spread out a set of data is from its mean. In PySpark, the standard deviation can be calculated using the standard deviation function from the Statistics module. This function takes in a column of data as an argument and returns the standard deviation of that column. It uses the formula for standard deviation, which involves calculating the mean, then finding the difference between each data point and the mean, squaring those differences, summing them, and finally taking the square root of the result. This process is repeated for each partition of data in a distributed PySpark environment, and the final standard deviation is calculated by combining the results from each partition.


You can use the following methods to calculate the standard deviation of a column in a PySpark DataFrame:

Method 1: Calculate Standard Deviation for One Specific Column

from pyspark.sql import functions as F

#calculate standard deviation of values in 'game1' column
df.agg(F.stddev('game1')).collect()[0][0]

Method 2: Calculate Standard Deviation for Multiple Columns

from pyspark.sql.functions import stddev

#calculate standard deviation for game1, game2 and game3 columns
df.select(stddev(df.game1), stddev(df.game2), stddev(df.game3)).show()

Note: The stddev function uses the to calculate the standard deviation.

If you would instead like to use the population standard deviation formula, then use the stddev_pop function instead.

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 25, 11, 10], 
        ['Nets', 22, 8, 14], 
        ['Hawks', 14, 22, 10], 
        ['Kings', 30, 22, 35], 
        ['Bulls', 15, 14, 12], 
        ['Blazers', 10, 14, 18]] 
  
#define column names
columns = ['team', 'game1', 'game2', 'game3'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+-----+-----+-----+
|   team|game1|game2|game3|
+-------+-----+-----+-----+
|   Mavs|   25|   11|   10|
|   Nets|   22|    8|   14|
|  Hawks|   14|   22|   10|
|  Kings|   30|   22|   35|
|  Bulls|   15|   14|   12|
|Blazers|   10|   14|   18|
+-------+-----+-----+-----+

Example 1: Calculate Standard Deviation for One Specific Column

We can use the following syntax to calculate the standard deviation of values in the game1 column of the DataFrame only:

from pyspark.sql import functions as F

#calculate standard deviation  of column named 'game1'
df.agg(F.stddev('game1')).collect()[0][0]

7.5806771905065755

The standard deviation of values in the game1 column turns out to be 7.5807.

Example 2: Calculate Standard Deviation for Multiple Columns

We can use the following syntax to calculate the standard deviation of values for the game1, game2 and game3 columns of the DataFrame:

from pyspark.sql.functions import stddev

#calculate standard deviation for game1, game2 and game3 columns
df.select(stddev(df.game1), stddev(df.game2), stddev(df.game3)).show()

+------------------+------------------+------------------+
|stddev_samp(game1)|stddev_samp(game2)|stddev_samp(game3)|
+------------------+------------------+------------------+
|7.5806771905065755| 5.741660619251774| 9.544631999192006|
+------------------+------------------+------------------+

From the output we can see:

  • The standard deviation of values in the game1 column is 7.5807.
  • The standard deviation of values in the game2 column is 5.7417.
  • The standard deviation of values in the game3 column is 9.5446.

Note: If there are null values in the column, the stddev function will ignore these values by default.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x