How to group by Date in PySpark DataFrame

Grouping by Date in PySpark DataFrame can be done by using the groupBy and window functions to aggregate the data into groups by a date column. This will then allow you to use the agg or agg functions to perform operations on each group of data, such as computing the sum, average, or count of the values within each group. This is a powerful way to analyse and visualize data in PySpark.


You can use the following syntax to group rows by date in a PySpark DataFrame:

from pyspark.sql.types import DateType

#calculate sum of sales by date
df.groupBy(df['ts'].cast(DateType()).alias('date'))
                   .agg(sum('sales').alias('sum_sales')).show())

This particular example groups the rows of the DataFrame by date based on the values in the ts column and then calculates the sum of the values in the sales column by date.

The following example shows how to use this syntax in practice.

Example: How to Group by Date in PySpark

Suppose we have the following PySpark DataFrame that contains information about the sales made on various timestamps at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.sql import functions as F

#define data
data = [['2023-01-15 04:14:22', 20],
        ['2023-01-15 10:55:01', 30],
        ['2023-01-15 18:34:59', 15],
        ['2023-01-16 21:20:25', 12],
        ['2023-01-16 22:20:05', 15],
        ['2023-01-17 04:17:02', 41]]

#define column names
columns = ['ts', 'sales']

#create dataframe using data and column names
df = spark.createDataFrame(data, columns)

#convert string column to timestamp
df = df.withColumn('ts', F.to_timestamp('ts', 'yyyy-MM-dd HH:mm:ss'))

#view dataframe
df.show()

+-------------------+-----+
|                 ts|sales|
+-------------------+-----+
|2023-01-15 04:14:22|   20|
|2023-01-15 10:55:01|   30|
|2023-01-15 18:34:59|   15|
|2023-01-16 21:20:25|   12|
|2023-01-16 22:20:05|   15|
|2023-01-17 04:17:02|   41|
+-------------------+-----+

Suppose we would like to calculate the sum of the sales, grouped by date.

We can use the following syntax to do so:

from pyspark.sql.types import DateType

#calculate sum of sales by date
df.groupBy(df['ts'].cast(DateType()).alias('date'))
                   .agg(sum('sales').alias('sum_sales')).show())

+----------+---------+
|      date|sum_sales|
+----------+---------+
|2023-01-15|       65|
|2023-01-16|       27|
|2023-01-17|       41|
+----------+---------+

The resulting DataFrame shows the sum of sales by date.

For example, we can see:

  • The sum of sales for 2023-01-15 is 65.
  • The sum of sales for 2023-01-16 is 27.
  • The sum of sales for 2023-01-17 is 41.

Note that you can also aggregate sales by a different metric if you’d like.

For example, you could use the following syntax to calculate the count of sales, grouped by date:

from pyspark.sql.types import DateType

#calculate count of sales by date
df.groupBy(df['ts'].cast(DateType()).alias('date'))
                   .agg(count('sales').alias('count_sales')).show())

+----------+-----------+
|      date|count_sales|
+----------+-----------+
|2023-01-15|          3|
|2023-01-16|          2|
|2023-01-17|          1|
+----------+-----------+

The resulting DataFrame now shows the count of sales by date.

We then used the agg function to calculate aggregate metrics, grouped by the unique dates.

The following tutorials explain how to perform other common tasks in PySpark:

x