Table of Contents
You can use the following syntax to add a specific number of days to a date column in a PySpark DataFrame:
from pyspark.sql import functions as F
df.withColumn('date_plus_5', F.date_add(df['date'], 5)).show()
This particular example creates a new column called date_plus_5 that adds 5 days to each date in the date column.
The following example shows how to use this syntax in practice.
Example: How to Add Days to a Date Column in PySpark
Suppose we have the following PySpark DataFrame that contains information about sales made on various dates at some company:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['2023-01-15', 225],
['2023-02-24', 260],
['2023-07-14', 413],
['2023-10-30', 368],
['2023-11-03', 322],
['2023-11-26', 278]]
#define column names
columns = ['date', 'sales']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----------+-----+
| date|sales|
+----------+-----+
|2023-01-15| 225|
|2023-02-24| 260|
|2023-07-14| 413|
|2023-10-30| 368|
|2023-11-03| 322|
|2023-11-26| 278|
+----------+-----+
Suppose we would like to add a new column that adds 5 days to each date in the date column.
We can use the following syntax to do so:
from pyspark.sql import functions as F
#add 5 days to each date in 'date' column
df.withColumn('date_plus_5', F.date_add(df['date'], 5)).show()
+----------+-----+-----------+
| date|sales|date_plus_5|
+----------+-----+-----------+
|2023-01-15| 225| 2023-01-20|
|2023-02-24| 260| 2023-03-01|
|2023-07-14| 413| 2023-07-19|
|2023-10-30| 368| 2023-11-04|
|2023-11-03| 322| 2023-11-08|
|2023-11-26| 278| 2023-12-01|
+----------+-----+-----------+
Notice that the new date_plus_5 column contains each of the dates from the date column with five days added.
Note that if you would instead like to subtract 5 days, you could use the date_sub() function instead:
from pyspark.sql import functions as F
#subtract 5 days from each date in 'date' column
df.withColumn('date_sub_5', F.date_sub(df['date'], 5)).show()
+----------+-----+----------+
| date|sales|date_sub_5|
+----------+-----+----------+
|2023-01-15| 225|2023-01-10|
|2023-02-24| 260|2023-02-19|
|2023-07-14| 413|2023-07-09|
|2023-10-30| 368|2023-10-25|
|2023-11-03| 322|2023-10-29|
|2023-11-26| 278|2023-11-21|
+----------+-----+----------+
Notice that the new date_sub_5 column contains each of the dates from the date column with five days subtracted.
Note that we used the withColumn function to return a new DataFrame with the date_sub_5 column added and all other columns left the same.
You can find the complete documentation for the PySpark withColumn function .