How do I calculate the sum of a column in PySpark?

Calculating the sum of a column in PySpark refers to the process of adding up all the values in a specific column of a PySpark DataFrame. This can be achieved by using the “sum” method, which takes in the name of the column as an argument and returns the sum of all the values in that column. This function can be useful for performing various data analysis tasks, such as finding the total revenue or expenses in a dataset. By utilizing the “sum” method, users can easily obtain the sum of a column in PySpark without having to write complex code or perform manual calculations.

Calculate the Sum of a Column in PySpark


You can use the following methods to calculate the sum of a column in a PySpark DataFrame:

Method 1: Calculate Sum for One Specific Column

to sum the values across multiple columns in a PySpark DataFrame:

from pyspark.sql import functions as F

#calculate sum of column named 'game1'
df.agg(F.sum('game1')).collect()[0][0]

Method 2: Calculate Sum for Multiple Columns

from pyspark.sql.functions import sum

#calculate sum for game1, game2 and game3 columns
df.select(sum(df.game1), sum(df.game2), sum(df.game3)).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 25, 11, 10], 
        ['Nets', 22, 8, 14], 
        ['Hawks', 14, 22, 10], 
        ['Kings', 30, 22, 35], 
        ['Bulls', 15, 14, 12], 
        ['Blazers', 10, 14, 18]] 
  
#define column names
columns = ['team', 'game1', 'game2', 'game3'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+-----+-----+-----+
|   team|game1|game2|game3|
+-------+-----+-----+-----+
|   Mavs|   25|   11|   10|
|   Nets|   22|    8|   14|
|  Hawks|   14|   22|   10|
|  Kings|   30|   22|   35|
|  Bulls|   15|   14|   12|
|Blazers|   10|   14|   18|
+-------+-----+-----+-----+

Example 1: Calculate Sum for One Specific Column

We can use the following syntax to calculate the sum of values in the game1 column of the DataFrame only:

from pyspark.sql import functions as F

#calculate sum of column named 'game1'
df.agg(F.sum('game1')).collect()[0][0]

116

The sum of values in the game1 column turns out to be 116.

We can verify this is correct by manually calculating the sum of values in this column:

Sum of values in game1: 25 + 22 + 14 + 30 + 15 + 10 = 116

Example 2: Calculate Sum for Multiple Columns

We can use the following syntax to calculate the sum of values for the game1, game2 and game3 columns of the DataFrame:

from pyspark.sql.functions import sum

#calculate sum for game1, game2 and game3 columns
df.select(sum(df.game1), sum(df.game2), sum(df.game3)).show()

+----------+----------+----------+
|sum(game1)|sum(game2)|sum(game3)|
+----------+----------+----------+
|       116|        91|        99|
+----------+----------+----------+
  • The sum of values in the game1 column is 116.
  • The sum of values in the game2 column is 91.
  • The sum of values in the game3 column is 99.

Note: If there are null values in the column, the sum function will ignore these values by default.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x