How can I calculate the difference between two dates in PySpark?

Calculating the difference between two dates in PySpark can be achieved by using the built-in functions and methods available in the PySpark library. The first step is to convert the dates into a proper date format, which can be done using the “to_date” function. Once the dates are in the correct format, the “datediff” function can be used to calculate the difference between the two dates in terms of days. This function takes the two dates as inputs and returns the number of days between them. Additionally, the “months_between” function can also be used to calculate the difference in terms of months. These functions provide an efficient and accurate way to calculate the difference between two dates in PySpark.

PySpark: Calculate a Difference Between Two Dates


You can use the following methods to calculate a difference between two dates in PySpark:

Method 1: Calculate Difference Between Dates in Days

from pyspark.sql import functions as F

df.withColumn('diff_days', F.datediff(F.to_date('end_date'), F.to_date('start_date'))).show()

Method 2: Calculate Difference Between Dates in Months

from pyspark.sql import functions as F

df.withColumn('diff_months', F.round(F.months_between(F.to_date('end_date'), F.to_date('start_date')),2)).show()

Method 3: Calculate Difference Between Dates in Years

from pyspark.sql import functions as F

df.withColumn('diff_years', F.round(F.months_between(F.to_date('end_date'), F.to_date('start_date'))/12,2)).show()

Each of these examples calculates a date difference using the ending dates from the end_date column and the starting dates from the start_date column in the DataFrame.

The following example shows how to use each method in practice.

Example: Calculate Difference Between Dates in PySpark

Suppose we have the following PySpark DataFrame that contains information about the start date and end date for various employees at a company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', '2020-10-25', '2023-01-15'],
        ['B', '2013-10-11', '2029-01-18'],
        ['C', '2015-10-17', '2022-04-15'],
        ['D', '2022-12-21', '2023-04-23'],
        ['E', '2021-04-14', '2023-07-25'],
        ['F', '2021-06-26', '2021-07-12']] 
  
#define column names
columns = ['employee', 'start_date', 'end_date'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+--------+----------+----------+
|employee|start_date|  end_date|
+--------+----------+----------+
|       A|2020-10-25|2023-01-15|
|       B|2013-10-11|2029-01-18|
|       C|2015-10-17|2022-04-15|
|       D|2022-12-21|2023-04-23|
|       E|2021-04-14|2023-07-25|
|       F|2021-06-26|2021-07-12|
+--------+----------+----------+

We can use the following syntax to calculate the date difference between each start and end date in terms of days, months and years:

from pyspark.sql import functions as F

#create new DataFrame with date differences columns
df.withColumn('diff_days', F.datediff(F.to_date('end_date'), F.to_date('start_date')))
  .withColumn('diff_months', F.round(F.months_between(F.to_date('end_date'), F.to_date('start_date')),2))
  .withColumn('diff_years', F.round(F.months_between(F.to_date('end_date'), F.to_date('start_date'))/12,2)).show()

+--------+----------+----------+---------+-----------+----------+
|employee|start_date|  end_date|diff_days|diff_months|diff_years|
+--------+----------+----------+---------+-----------+----------+
|       A|2020-10-25|2023-01-15|      812|      26.68|      2.22|
|       B|2013-10-11|2029-01-18|     5578|     183.23|     15.27|
|       C|2015-10-17|2022-04-15|     2372|      77.94|      6.49|
|       D|2022-12-21|2023-04-23|      123|       4.06|      0.34|
|       E|2021-04-14|2023-07-25|      832|      27.35|      2.28|
|       F|2021-06-26|2021-07-12|       16|       0.55|      0.05|
+--------+----------+----------+---------+-----------+----------+

Notice that three new columns were added to the DataFrame that show the date difference between each start and end date in terms of days, months and years.

For example:

  • The are 812 days between 2020-10-25 and 2023-01-15.
  • The are 26.68 months between 2020-10-25 and 2023-01-15.
  • The are 2.22 years between 2020-10-25 and 2023-01-15.

Note that we use the round function to round the date difference for the months and years to two decimal places but you can choose to round to however many decimal places you’d like.

Also note that we used the withColumn function several times to return a new DataFrame with several columns added and all other existing columns left the same.

You can find the complete documentation for the PySpark withColumn function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x