How can I calculate the sum of each row in a PySpark DataFrame?

To calculate the sum of each row in a PySpark DataFrame, one can use the built-in function “sum” along with the “select” and “withColumn” functions. The “sum” function allows for the aggregation of numerical values in a specific column, while the “select” function allows for the selection of specific columns in the DataFrame. The “withColumn” function can then be used to create a new column with the summed values. This approach allows for efficient and accurate calculation of the sum of each row in a PySpark DataFrame.

PySpark: Calculate Sum of Each Row in DataFrame


You can use the following syntax to calculate the sum of values in each row of a PySpark DataFrame:

from pyspark.sql import functions as F

#add new column that contains sum of each row
df_new = df.withColumn('row_sum', sum([F.col(c) for c in df.columns]))

This particular example creates a new column named row_sum that contains the sum of values in each row.

The following example shows how to use this syntax in practice.

Example: How to Calculate Sum of Each Row in PySpark

Suppose we have the following PySpark DataFrame that shows the number of points scored in three different games by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[14, 16, 10], 
        [12, 10, 13], 
        [8, 10, 20], 
        [15, 15, 15], 
        [19, 3, 15],
        [24, 40, 23],
        [15, 12, 19],
        [10, 10, 16]]
  
#define column names
columns = ['game1', 'game2', 'game3'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-----+-----+-----+
|game1|game2|game3|
+-----+-----+-----+
|   14|   16|   10|
|   12|   10|   13|
|    8|   10|   20|
|   15|   15|   15|
|   19|    3|   15|
|   24|   40|   23|
|   15|   12|   19|
|   10|   10|   16|
+-----+-----+-----+

We can use the following syntax to create a new column named row_sum that contains the sum of the values in each row:

from pyspark.sql import functions as F

#add new column that contains sum of each row
df_new = df.withColumn('row_sum', sum([F.col(c) for c in df.columns]))

#view new DataFrame
df_new.show()

+-----+-----+-----+-------+
|game1|game2|game3|row_sum|
+-----+-----+-----+-------+
|   14|   16|   10|     40|
|   12|   10|   13|     35|
|    8|   10|   20|     38|
|   15|   15|   15|     45|
|   19|    3|   15|     37|
|   24|   40|   23|     87|
|   15|   12|   19|     46|
|   10|   10|   16|     36|
+-----+-----+-----+-------+

The new column named row_sum contains the sum of the values in each row.

For example:

  • The sum of values in the first row is 14 + 16 + 10 = 40.
  • The sum of values in the first row is 12 + 10 + 13 = 35.
  • The sum of values in the first row is 8 + 10 + 20 = 38.

And so on.

Note: If there are null values in the column, the sum function will ignore these values by default.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x