Table of Contents

Calculating the standard deviation in PySpark involves using the “stat” module from the PySpark library. This module provides a “summary” function which can be used to calculate the mean, count, standard deviation, and other summary statistics for a given column in a PySpark dataframe. The standard deviation can be easily calculated by passing in the column name as a parameter to the “summary” function. This method allows for efficient and accurate standard deviation calculations on large datasets in a distributed computing environment. Additionally, PySpark also offers various built-in functions and methods for calculating the standard deviation, making it a convenient and reliable tool for statistical analysis in big data applications.

Calculate Standard Deviation in PySpark

You can use the following methods to calculate the standard deviation of a column in a PySpark DataFrame:

Method 1: Calculate Standard Deviation for One Specific Column

from pyspark.sql import functions as F

#calculate standard deviation of values in 'game1' column
df.agg(F.stddev('game1')).collect()[0][0]

Method 2: Calculate Standard Deviation for Multiple Columns

from pyspark.sql.functions import stddev

#calculate standard deviation for game1, game2 and game3 columns
df.select(stddev(df.game1), stddev(df.game2), stddev(df.game3)).show()

Note: The stddev function uses the to calculate the standard deviation.

If you would instead like to use the population standard deviation formula, then use the stddev_pop function instead.

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 25, 11, 10], 
        ['Nets', 22, 8, 14], 
        ['Hawks', 14, 22, 10], 
        ['Kings', 30, 22, 35], 
        ['Bulls', 15, 14, 12], 
        ['Blazers', 10, 14, 18]] 
  
#define column names
columns = ['team', 'game1', 'game2', 'game3'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+-----+-----+-----+
|   team|game1|game2|game3|
+-------+-----+-----+-----+
|   Mavs|   25|   11|   10|
|   Nets|   22|    8|   14|
|  Hawks|   14|   22|   10|
|  Kings|   30|   22|   35|
|  Bulls|   15|   14|   12|
|Blazers|   10|   14|   18|
+-------+-----+-----+-----+

Example 1: Calculate Standard Deviation for One Specific Column

We can use the following syntax to calculate the standard deviation of values in the game1 column of the DataFrame only:

from pyspark.sql import functions as F

#calculate standard deviation  of column named 'game1'
df.agg(F.stddev('game1')).collect()[0][0]

7.5806771905065755

The standard deviation of values in the game1 column turns out to be 7.5807.

Example 2: Calculate Standard Deviation for Multiple Columns

We can use the following syntax to calculate the standard deviation of values for the game1, game2 and game3 columns of the DataFrame:

from pyspark.sql.functions import stddev

#calculate standard deviation for game1, game2 and game3 columns
df.select(stddev(df.game1), stddev(df.game2), stddev(df.game3)).show()

+------------------+------------------+------------------+
|stddev_samp(game1)|stddev_samp(game2)|stddev_samp(game3)|
+------------------+------------------+------------------+
|7.5806771905065755| 5.741660619251774| 9.544631999192006|
+------------------+------------------+------------------+

From the output we can see:

The standard deviation of values in the game1 column is 7.5807.
The standard deviation of values in the game2 column is 5.7417.
The standard deviation of values in the game3 column is 9.5446.

Note: If there are null values in the column, the stddev function will ignore these values by default.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How can I calculate the standard deviation in PySpark?

Calculate Standard Deviation in PySpark

Example 1: Calculate Standard Deviation for One Specific Column

Example 2: Calculate Standard Deviation for Multiple Columns

Additional Resources

Requst a

Scale

Example 1: Calculate Standard Deviation for One Specific Column

Example 2: Calculate Standard Deviation for Multiple Columns

Additional Resources

Related terms:

Requst a

Scale