Table of Contents

To calculate the sum of each row in a DataFrame, you can use the DataFrame.sum() method. This method will return a new DataFrame where the sum of all the elements in each row is given. You can specify which columns you would like to include or exclude in the sum calculation by passing different arguments to the DataFrame.sum() method.

You can use the following syntax to calculate the sum of values in each row of a PySpark DataFrame:

from pyspark.sql import functions as F

#add new column that contains sum of each row
df_new = df.withColumn('row_sum', sum([F.col(c) for c in df.columns]))

This particular example creates a new column named row_sum that contains the sum of values in each row.

The following example shows how to use this syntax in practice.

Example: How to Calculate Sum of Each Row in PySpark

Suppose we have the following PySpark DataFrame that shows the number of points scored in three different games by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[14, 16, 10], 
        [12, 10, 13], 
        [8, 10, 20], 
        [15, 15, 15], 
        [19, 3, 15],
        [24, 40, 23],
        [15, 12, 19],
        [10, 10, 16]]
  
#define column names
columns = ['game1', 'game2', 'game3'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-----+-----+-----+
|game1|game2|game3|
+-----+-----+-----+
|   14|   16|   10|
|   12|   10|   13|
|    8|   10|   20|
|   15|   15|   15|
|   19|    3|   15|
|   24|   40|   23|
|   15|   12|   19|
|   10|   10|   16|
+-----+-----+-----+

We can use the following syntax to create a new column named row_sum that contains the sum of the values in each row:

from pyspark.sql import functions as F

#add new column that contains sum of each row
df_new = df.withColumn('row_sum', sum([F.col(c) for c in df.columns]))

#view new DataFrame
df_new.show()

+-----+-----+-----+-------+
|game1|game2|game3|row_sum|
+-----+-----+-----+-------+
|   14|   16|   10|     40|
|   12|   10|   13|     35|
|    8|   10|   20|     38|
|   15|   15|   15|     45|
|   19|    3|   15|     37|
|   24|   40|   23|     87|
|   15|   12|   19|     46|
|   10|   10|   16|     36|
+-----+-----+-----+-------+

The new column named row_sum contains the sum of the values in each row.

For example:

The sum of values in the first row is 14 + 16 + 10 = 40.
The sum of values in the first row is 12 + 10 + 13 = 35.
The sum of values in the first row is 8 + 10 + 20 = 38.

And so on.

Note: If there are null values in the column, the sum function will ignore these values by default.

The following tutorials explain how to perform other common tasks in PySpark:

How to Calculate Sum of Each Row in DataFrame

Example: How to Calculate Sum of Each Row in PySpark

Requst a

Scale

Example: How to Calculate Sum of Each Row in PySpark

Related terms:

Requst a

Scale