How can I calculate a cumulative sum in PySpark?

PySpark is a powerful open-source framework for large-scale data processing in Python. It offers various built-in functions and methods to efficiently manipulate and analyze data. One such function is the “cumulative sum” function, which calculates the sum of all the previous values in a dataset. This can be achieved using the “cumsum” method, which takes in a column or a list of columns as input and returns the cumulative sum of each row in the dataset. This function is particularly useful in data analysis and can be easily implemented in PySpark to perform complex calculations on large datasets.

Calculate a Cumulative Sum in PySpark


You can use the following methods to calculate a cumulative sum in a PySpark DataFrame:

Method 1: Calculate Cumulative Sum of One Column

from pyspark.sql import Window
from pyspark.sql import functions as F

#define window for calculating cumulative sum
my_window = (Window.orderBy('day')
             .rowsBetween(Window.unboundedPreceding, 0))

#create new DataFrame that contains cumulative sales column
df_new = df.withColumn('cum_sales', F.sum('sales').over(my_window))

Method 2: Calculate Cumulative Sum of One Column, Grouped by Another Column

from pyspark.sql import Window
from pyspark.sql import functions as F

#define window for calculating cumulative sum
my_window = (Window.partitionBy('store').orderBy('day')
             .rowsBetween(Window.unboundedPreceding, 0))

#create new DataFrame that contains cumulative sales, grouped by store
df_new = df.withColumn('cum_sales', F.sum('sales').over(my_window))

The following examples show how to use each method in practice.

Example 1: Calculate Cumulative Sum of One Column

Suppose we have the following PySpark DataFrame that contains information about the sales made during 10 consecutive days at some grocery store:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[1, 11], 
        [2, 8], 
        [3, 4], 
        [4, 5], 
        [5, 5], 
        [6, 8],
        [7, 7],
        [8, 7],
        [9, 6],
        [10, 4]] 
  
#define column names
columns = ['day', 'sales'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+---+-----+
|day|sales|
+---+-----+
|  1|   11|
|  2|    8|
|  3|    4|
|  4|    5|
|  5|    5|
|  6|    8|
|  7|    7|
|  8|    7|
|  9|    6|
| 10|    4|
+---+-----+

We can use the following syntax to calculate the cumulative sum of the values in the sales column:

from pyspark.sql import Window
from pyspark.sql import functions as F

#define window for calculating cumulative sum
my_window = (Window.orderBy('day')
             .rowsBetween(Window.unboundedPreceding, 0))

#create new DataFrame that contains cumulative sales column
df_new = df.withColumn('cum_sales', F.sum('sales').over(my_window))

#view new DataFrame
df_new.show()

+---+-----+---------+
|day|sales|cum_sales|
+---+-----+---------+
|  1|   11|       11|
|  2|    8|       19|
|  3|    4|       23|
|  4|    5|       28|
|  5|    5|       33|
|  6|    8|       41|
|  7|    7|       48|
|  8|    7|       55|
|  9|    6|       61|
| 10|    4|       65|
+---+-----+---------+

The resulting DataFrame contains a new column called cum_sales that shows the cumulative values in the sales column.

Example 2: Calculate Cumulative Sum of One Column, Grouped by Another Column

Suppose we have the following PySpark DataFrame that contains information about the sales made during 5 consecutive days at two different grocery stores:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 1, 11], 
        ['A', 2, 8], 
        ['A', 3, 4], 
        ['A', 4, 5], 
        ['A', 5, 5], 
        ['B', 6, 8],
        ['B', 7, 7],
        ['B', 8, 7],
        ['B', 9, 6],
        ['B', 10, 4]] 
  
#define column names
columns = ['store', 'day', 'sales'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-----+---+-----+
|store|day|sales|
+-----+---+-----+
|    A|  1|   11|
|    A|  2|    8|
|    A|  3|    4|
|    A|  4|    5|
|    A|  5|    5|
|    B|  6|    8|
|    B|  7|    7|
|    B|  8|    7|
|    B|  9|    6|
|    B| 10|    4|
+-----+---+-----+

We can use the following syntax to calculate the cumulative sum of the values in the sales column, grouped by the values in the store column:

from pyspark.sql import Window
from pyspark.sql import functions as F

#define window for calculating cumulative sum
my_window = (Window.partitionBy('store').orderBy('day')
             .rowsBetween(Window.unboundedPreceding, 0))

#create new DataFrame that contains cumulative sales, grouped by store
df_new = df.withColumn('cum_sales', F.sum('sales').over(my_window))

#view new DataFrame
df_new.show()

+-----+---+-----+---------+
|store|day|sales|cum_sales|
+-----+---+-----+---------+
|    A|  1|   11|       11|
|    A|  2|    8|       19|
|    A|  3|    4|       23|
|    A|  4|    5|       28|
|    A|  5|    5|       33|
|    B|  6|    8|        8|
|    B|  7|    7|       15|
|    B|  8|    7|       22|
|    B|  9|    6|       28|
|    B| 10|    4|       32|
+-----+---+-----+---------+

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x