How can I group a PySpark DataFrame by date?

To group a PySpark DataFrame by date, one can use the groupBy() function and specify the date column as the grouping key. This will group the data based on the unique dates present in the specified column, allowing for further analysis or manipulation of the data based on the date. Additionally, the groupBy() function can be combined with other functions such as count() or sum() to perform aggregations on the grouped data. This approach is useful for organizing and analyzing time-series data in a PySpark DataFrame.

Group by Date in PySpark DataFrame


You can use the following syntax to group rows by date in a PySpark DataFrame:

from pyspark.sql.types import DateType

#calculate sum of sales by date
df.groupBy(df['ts'].cast(DateType()).alias('date'))
                   .agg(sum('sales').alias('sum_sales')).show())

This particular example groups the rows of the DataFrame by date based on the values in the ts column and then calculates the sum of the values in the sales column by date.

The following example shows how to use this syntax in practice.

Example: How to Group by Date in PySpark

Suppose we have the following PySpark DataFrame that contains information about the sales made on various timestamps at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.sql import functions as F

#define data
data = [['2023-01-15 04:14:22', 20],
        ['2023-01-15 10:55:01', 30],
        ['2023-01-15 18:34:59', 15],
        ['2023-01-16 21:20:25', 12],
        ['2023-01-16 22:20:05', 15],
        ['2023-01-17 04:17:02', 41]]

#define column names
columns = ['ts', 'sales']

#create dataframe using data and column names
df = spark.createDataFrame(data, columns)

#convert string column to timestamp
df = df.withColumn('ts', F.to_timestamp('ts', 'yyyy-MM-dd HH:mm:ss'))

#view dataframe
df.show()

+-------------------+-----+
|                 ts|sales|
+-------------------+-----+
|2023-01-15 04:14:22|   20|
|2023-01-15 10:55:01|   30|
|2023-01-15 18:34:59|   15|
|2023-01-16 21:20:25|   12|
|2023-01-16 22:20:05|   15|
|2023-01-17 04:17:02|   41|
+-------------------+-----+

Suppose we would like to calculate the sum of the sales, grouped by date.

We can use the following syntax to do so:

from pyspark.sql.types import DateType

#calculate sum of sales by date
df.groupBy(df['ts'].cast(DateType()).alias('date'))
                   .agg(sum('sales').alias('sum_sales')).show())

+----------+---------+
|      date|sum_sales|
+----------+---------+
|2023-01-15|       65|
|2023-01-16|       27|
|2023-01-17|       41|
+----------+---------+

The resulting DataFrame shows the sum of sales by date.

For example, we can see:

  • The sum of sales for 2023-01-15 is 65.
  • The sum of sales for 2023-01-16 is 27.
  • The sum of sales for 2023-01-17 is 41.

Note that you can also aggregate sales by a different metric if you’d like.

For example, you could use the following syntax to calculate the count of sales, grouped by date:

from pyspark.sql.types import DateType

#calculate count of sales by date
df.groupBy(df['ts'].cast(DateType()).alias('date'))
                   .agg(count('sales').alias('count_sales')).show())

+----------+-----------+
|      date|count_sales|
+----------+-----------+
|2023-01-15|          3|
|2023-01-16|          2|
|2023-01-17|          1|
+----------+-----------+

The resulting DataFrame now shows the count of sales by date.

We then used the agg function to calculate aggregate metrics, grouped by the unique dates.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x