How can I round a date in PySpark to the first day of the month?

To round a date in PySpark to the first day of the month, you can use the “date_trunc” function and specify the unit as “month”. This will truncate the given date to the first day of the month, effectively rounding it down to the nearest month. This method is useful for data analysis and reporting purposes, as it allows for a more organized and consistent representation of dates within a dataset.

PySpark: Round Date to First Day of Month


You can use the following syntax to round dates to the first day of the month in a PySpark DataFrame:

import pyspark.sql.functions as F

#add new column that rounds date to first day of month
df_new = df.withColumn('first_day_of_month', F.trunc('date', 'month'))

This particular example creates a new column named first_day_of_month that rounds each date in the date column to the first day of the month.

The following example shows how to use this syntax in practice.

Example: How to Round Date to First Day of Month in PySpark

Suppose we have the following PySpark DataFrame that contains information about the sales made on various days at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['2023-04-11', 22],
        ['2023-04-15', 14],
        ['2023-04-17', 12],
        ['2023-05-21', 15],
        ['2023-05-23', 30],
        ['2023-10-26', 45],
        ['2023-10-28', 32],
        ['2023-10-29', 47]]
  
#define column names
columns = ['date', 'sales']
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----------+-----+
|      date|sales|
+----------+-----+
|2023-04-11|   22|
|2023-04-15|   14|
|2023-04-17|   12|
|2023-05-21|   15|
|2023-05-23|   30|
|2023-10-26|   45|
|2023-10-28|   32|
|2023-10-29|   47|
+----------+-----+

Suppose we would like to round each date in the date column to the first day of the month.

We can use the following syntax to do so:

import pyspark.sql.functions as F

#add new column that rounds date to first day of month
df_new = df.withColumn('first_day_of_month', F.trunc('date', 'month'))

#view new DataFrame
df_new.show()

+----------+-----+------------------+
|      date|sales|first_day_of_month|
+----------+-----+------------------+
|2023-04-11|   22|        2023-04-01|
|2023-04-15|   14|        2023-04-01|
|2023-04-17|   12|        2023-04-01|
|2023-05-21|   15|        2023-05-01|
|2023-05-23|   30|        2023-05-01|
|2023-10-26|   45|        2023-10-01|
|2023-10-28|   32|        2023-10-01|
|2023-10-29|   47|        2023-10-01|
+----------+-----+------------------+

The new first_day_of_month column contains each date from the date column rounded to the first day of the month.

For example, we can see:

  • The date 2023-04-11 has been rounded to 2023-04-01.
  • The date 2023-04-15 has been rounded to 2023-04-01.
  • The date 2023-04-17 has been rounded to 2023-04-01.

And so on.

Note: You can find the complete documentation for the PySpark trunc function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x