How can I combine rows with the same column values in PySpark?

PySpark: Combine Rows with Same Column Values


You can use the following syntax to combine rows with the same column values in a PySpark DataFrame:

from pyspark.sql.functions import*#create new DataFrame by combining rows with same ID values
df_new = df.groupBy('ID').agg(first('employee').alias('employee'),
                              sum('sales').alias('sum_sales'),
                              sum('returns').alias('sum_returns'))

This particular example combines the rows in the DataFrame with the same value in the ID column and then calculates the sum of the values in the sales and returns columns.

The following example shows how to use this syntax in practice.

Example: Combine Rows with Same Column Values in PySpark

Suppose we have the following PySpark DataFrame that contains information about sales and returns for various salesmen at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[101, 'Dan', 4, 1],
        [101, 'Dan', 1, 2],
        [102, 'Ken', 3, 2],
        [103, 'Rick', 2, 1],
        [103, 'Rick', 5, 3],
        [103, 'Rick', 3, 2]]

#define column names
columns = ['ID', 'employee', 'sales', 'returns']

#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+---+--------+-----+-------+
| ID|employee|sales|returns|
+---+--------+-----+-------+
|101|     Dan|    4|      1|
|101|     Dan|    1|      2|
|102|     Ken|    3|      2|
|103|    Rick|    2|      1|
|103|    Rick|    5|      3|
|103|    Rick|    3|      2|
+---+--------+-----+-------+

Suppose we would like to combine the rows with the same values in the ID column and then aggregate the values in the remaining columns.

We can use the following syntax to do so:

from pyspark.sql.functions import*#create new DataFrame by combining rows with same ID values
df_new = df.groupBy('ID').agg(first('employee').alias('employee'),
                              sum('sales').alias('sum_sales'),
                              sum('returns').alias('sum_returns'))

#view new DataFrame
df_new.show()

+---+--------+---------+-----------+
| ID|employee|sum_sales|sum_returns|
+---+--------+---------+-----------+
|101|     Dan|        5|          3|
|102|     Ken|        3|          2|
|103|    Rick|       10|          6|
+---+--------+---------+-----------+

The resulting DataFrame combines each row with the same value in the ID column and then returns the name of the employee, the sum of values from the sales column and the sum of values from the returns column.

Note #1: We used the first function to return the first name of the employee associated with a particular ID.

Note #2: We used the alias function to specify the names to use for the columns in the resulting DataFrame.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x