How can I add years to a date column in PySpark?

Adding years to a date column in PySpark can be achieved by using the “withColumn” function and the “date_add” function. The “withColumn” function allows for the creation of a new column while the “date_add” function allows for the addition of a specified number of years to a date. This process involves specifying the date column, the name of the new column, and the number of years to be added. This method is useful for data manipulation and analysis in PySpark, allowing for the easy calculation of future dates and the transformation of data.

PySpark: Add Years to a Date Column


You can use the following syntax to add a specific number of years to a date column in a PySpark DataFrame:

from pyspark.sql import functions as F

df.withColumn('add5years', F.add_months(df['date'], 12*5)).show()

This particular example creates a new column called add5years that adds 5 years to each date in the date column.

Note: There is no function in PySpark to directly add a specific number of years to a date, so instead we must use the add_months() function and simply multiply the number of years we’d like to add by 12.

The following example shows how to use this syntax in practice.

Example: How to Add Years to a Date Column in PySpark

Suppose we have the following PySpark DataFrame that contains information about sales made on various dates at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['2023-01-15', 225],
        ['2023-02-24', 260],
        ['2023-07-14', 413],
        ['2023-10-30', 368],
        ['2023-11-03', 322],
        ['2023-11-26', 278]] 
  
#define column names
columns = ['date', 'sales'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----------+-----+
|      date|sales|
+----------+-----+
|2023-01-15|  225|
|2023-02-24|  260|
|2023-07-14|  413|
|2023-10-30|  368|
|2023-11-03|  322|
|2023-11-26|  278|
+----------+-----+

Suppose we would like to add a new column that adds 5 years to each date in the date column.

We can use the following syntax to do so:

from pyspark.sql import functions as F

#add 5 years to each date in 'date' column
df.withColumn('add5years', F.add_months(df['date'], 12*5)).show() 

+----------+-----+----------+
|      date|sales| add5years|
+----------+-----+----------+
|2023-01-15|  225|2028-01-15|
|2023-02-24|  260|2028-02-24|
|2023-07-14|  413|2028-07-14|
|2023-10-30|  368|2028-10-30|
|2023-11-03|  322|2028-11-03|
|2023-11-26|  278|2028-11-26|
+----------+-----+----------+

Notice that the new add5years column contains each of the dates from the date column with five years added.

Note that if you would instead like to subtract 5 years, you could multiply 5 by -12 in the add_months() function instead:

from pyspark.sql import functions as F

#subtract 5 years from each date in 'date' column
df.withColumn('sub5years', F.add_months(df['date'], -12*5)).show()

+----------+-----+----------+
|      date|sales| sub5years|
+----------+-----+----------+
|2023-01-15|  225|2018-01-15|
|2023-02-24|  260|2018-02-24|
|2023-07-14|  413|2018-07-14|
|2023-10-30|  368|2018-10-30|
|2023-11-03|  322|2018-11-03|
|2023-11-26|  278|2018-11-26|
+----------+-----+----------+

Notice that the new sub5years column contains each of the dates from the date column with five years subtracted.

Note that we used the withColumn function to return a new DataFrame with the sub5years column added and all other columns left the same.

You can find the complete documentation for the PySpark withColumn function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x