PySpark: Add Years to a Date Column


You can use the following syntax to add a specific number of years to a date column in a PySpark DataFrame:

from pyspark.sql import functions as F

df.withColumn('add5years', F.add_months(df['date'], 12*5)).show()

This particular example creates a new column called add5years that adds 5 years to each date in the date column.

Note: There is no function in PySpark to directly add a specific number of years to a date, so instead we must use the add_months() function and simply multiply the number of years we’d like to add by 12.

The following example shows how to use this syntax in practice.

Example: How to Add Years to a Date Column in PySpark

Suppose we have the following PySpark DataFrame that contains information about sales made on various dates at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['2023-01-15', 225],
        ['2023-02-24', 260],
        ['2023-07-14', 413],
        ['2023-10-30', 368],
        ['2023-11-03', 322],
        ['2023-11-26', 278]] 
  
#define column names
columns = ['date', 'sales'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----------+-----+
|      date|sales|
+----------+-----+
|2023-01-15|  225|
|2023-02-24|  260|
|2023-07-14|  413|
|2023-10-30|  368|
|2023-11-03|  322|
|2023-11-26|  278|
+----------+-----+

Suppose we would like to add a new column that adds 5 years to each date in the date column.

We can use the following syntax to do so:

from pyspark.sql import functions as F

#add 5 years to each date in 'date' column
df.withColumn('add5years', F.add_months(df['date'], 12*5)).show() 

+----------+-----+----------+
|      date|sales| add5years|
+----------+-----+----------+
|2023-01-15|  225|2028-01-15|
|2023-02-24|  260|2028-02-24|
|2023-07-14|  413|2028-07-14|
|2023-10-30|  368|2028-10-30|
|2023-11-03|  322|2028-11-03|
|2023-11-26|  278|2028-11-26|
+----------+-----+----------+

Notice that the new add5years column contains each of the dates from the date column with five years added.

Note that if you would instead like to subtract 5 years, you could multiply 5 by -12 in the add_months() function instead:

from pyspark.sql import functions as F

#subtract 5 years from each date in 'date' column
df.withColumn('sub5years', F.add_months(df['date'], -12*5)).show()

+----------+-----+----------+
|      date|sales| sub5years|
+----------+-----+----------+
|2023-01-15|  225|2018-01-15|
|2023-02-24|  260|2018-02-24|
|2023-07-14|  413|2018-07-14|
|2023-10-30|  368|2018-10-30|
|2023-11-03|  322|2018-11-03|
|2023-11-26|  278|2018-11-26|
+----------+-----+----------+

Notice that the new sub5years column contains each of the dates from the date column with five years subtracted.

Note that we used the withColumn function to return a new DataFrame with the sub5years column added and all other columns left the same.

You can find the complete documentation for the PySpark withColumn function .

x