How can I add months to a date column in PySpark?

Adding months to a date column in PySpark is a simple process that allows you to manipulate and modify dates within your data. This can be achieved by utilizing the built-in functions and methods provided by PySpark, such as the “date_add” function which allows you to add a specified number of months to a given date. By using this function, you can easily create new columns with adjusted dates or update existing date columns with the desired number of months added. This feature is useful for various data analysis and processing tasks, such as forecasting, trend analysis, and data manipulation. Incorporating this functionality into your PySpark code can enhance the efficiency and accuracy of your data analysis workflow.

PySpark: Add Months to a Date Column


You can use the following syntax to add a specific number of months to a date column in a PySpark DataFrame:

from pyspark.sql import functions as F

df.withColumn('add5months', F.add_months(df['date'], 5)).show()

This particular example creates a new column called add5months that adds 5 months to each date in the date column.

The following example shows how to use this syntax in practice.

Example: How to Add Months to a Date Column in PySpark

Suppose we have the following PySpark DataFrame that contains information about sales made on various dates at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['2023-01-15', 225],
        ['2023-02-24', 260],
        ['2023-07-14', 413],
        ['2023-10-30', 368],
        ['2023-11-03', 322],
        ['2023-11-26', 278]] 
  
#define column names
columns = ['date', 'sales'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----------+-----+
|      date|sales|
+----------+-----+
|2023-01-15|  225|
|2023-02-24|  260|
|2023-07-14|  413|
|2023-10-30|  368|
|2023-11-03|  322|
|2023-11-26|  278|
+----------+-----+

Suppose we would like to add a new column that adds 5 months to each date in the date column.

We can use the following syntax to do so:

from pyspark.sql import functions as F

#add 5 months to each date in 'date' column
df.withColumn('add5months', F.add_months(df['date'], 5)).show()

+----------+-----+----------+
|      date|sales|add5months|
+----------+-----+----------+
|2023-01-15|  225|2023-06-15|
|2023-02-24|  260|2023-07-24|
|2023-07-14|  413|2023-12-14|
|2023-10-30|  368|2024-03-30|
|2023-11-03|  322|2024-04-03|
|2023-11-26|  278|2024-04-26|
+----------+-----+----------+

Notice that the new add5months column contains each of the dates from the date column with five months added.

Note that if you would instead like to subtract 5 months, you could use a value of -5 in the add_months() function instead:

from pyspark.sql import functions as F

#subtract 5 months from each date in 'date' column
df.withColumn('sub5months', F.add_months(df['date'], -5)).show()

+----------+-----+----------+
|      date|sales|sub5months|
+----------+-----+----------+
|2023-01-15|  225|2022-08-15|
|2023-02-24|  260|2022-09-24|
|2023-07-14|  413|2023-02-14|
|2023-10-30|  368|2023-05-30|
|2023-11-03|  322|2023-06-03|
|2023-11-26|  278|2023-06-26|
+----------+-----+----------+

Notice that the new sub5months column contains each of the dates from the date column with five months subtracted.

Note that we used the withColumn function to return a new DataFrame with the sub5months column added and all other columns left the same.

You can find the complete documentation for the PySpark withColumn function .

Additional Resources

x