How do you combine rows with same column values in pyspark?

In PySpark, combining rows with same column values can be done by using the groupBy() and agg() methods. The groupBy() method groups the dataframe based on the given column(s) and the agg() method is used to perform aggregations like sum, count, min, and max on the grouped dataframe. This will result in a dataframe with fewer rows than the original dataframe, with one row for each value of the given column(s).


You can use the following syntax to combine rows with the same column values in a PySpark DataFrame:

from pyspark.sql.functions import *

#create new DataFrame by combining rows with same ID values
df_new = df.groupBy('ID').agg(first('employee').alias('employee'),
                              sum('sales').alias('sum_sales'),
                              sum('returns').alias('sum_returns'))

This particular example combines the rows in the DataFrame with the same value in the ID column and then calculates the sum of the values in the sales and returns columns.

The following example shows how to use this syntax in practice.

Example: Combine Rows with Same Column Values in PySpark

Suppose we have the following PySpark DataFrame that contains information about sales and returns for various salesmen at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[101, 'Dan', 4, 1],
        [101, 'Dan', 1, 2],
        [102, 'Ken', 3, 2],
        [103, 'Rick', 2, 1],
        [103, 'Rick', 5, 3],
        [103, 'Rick', 3, 2]]

#define column names
columns = ['ID', 'employee', 'sales', 'returns']

#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+---+--------+-----+-------+
| ID|employee|sales|returns|
+---+--------+-----+-------+
|101|     Dan|    4|      1|
|101|     Dan|    1|      2|
|102|     Ken|    3|      2|
|103|    Rick|    2|      1|
|103|    Rick|    5|      3|
|103|    Rick|    3|      2|
+---+--------+-----+-------+

Suppose we would like to combine the rows with the same values in the ID column and then aggregate the values in the remaining columns.

We can use the following syntax to do so:

from pyspark.sql.functions import *

#create new DataFrame by combining rows with same ID values
df_new = df.groupBy('ID').agg(first('employee').alias('employee'),
                              sum('sales').alias('sum_sales'),
                              sum('returns').alias('sum_returns'))

#view new DataFrame
df_new.show()

+---+--------+---------+-----------+
| ID|employee|sum_sales|sum_returns|
+---+--------+---------+-----------+
|101|     Dan|        5|          3|
|102|     Ken|        3|          2|
|103|    Rick|       10|          6|
+---+--------+---------+-----------+

The resulting DataFrame combines each row with the same value in the ID column and then returns the name of the employee, the sum of values from the sales column and the sum of values from the returns column.

Note #1: We used the first function to return the first name of the employee associated with a particular ID.

Note #2: We used the alias function to specify the names to use for the columns in the resulting DataFrame.

The following tutorials explain how to perform other common tasks in PySpark:

x