What is the process for calculating the difference between two dates in PySpark?

The process for calculating the difference between two dates in PySpark involves first converting the dates into a PySpark DateType format. Then, the datediff function can be used to calculate the difference in days between the two dates. This function takes the two dates as parameters and returns the difference in days as an integer. Alternatively, the datediff function can also be used to calculate the difference in months or years by specifying the desired unit of measurement as a third parameter. This process can be used to perform date calculations in PySpark, making it a useful tool for data analysis and manipulation.


You can use the following methods to calculate a difference between two dates in PySpark:

Method 1: Calculate Difference Between Dates in Days

from pyspark.sql import functions as F

df.withColumn('diff_days', F.datediff(F.to_date('end_date'), F.to_date('start_date'))).show()

Method 2: Calculate Difference Between Dates in Months

from pyspark.sql import functions as F

df.withColumn('diff_months', F.round(F.months_between(F.to_date('end_date'), F.to_date('start_date')),2)).show()

Method 3: Calculate Difference Between Dates in Years

from pyspark.sql import functions as F

df.withColumn('diff_years', F.round(F.months_between(F.to_date('end_date'), F.to_date('start_date'))/12,2)).show()

Each of these examples calculates a date difference using the ending dates from the end_date column and the starting dates from the start_date column in the DataFrame.

The following example shows how to use each method in practice.

Example: Calculate Difference Between Dates in PySpark

Suppose we have the following PySpark DataFrame that contains information about the start date and end date for various employees at a company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', '2020-10-25', '2023-01-15'],
        ['B', '2013-10-11', '2029-01-18'],
        ['C', '2015-10-17', '2022-04-15'],
        ['D', '2022-12-21', '2023-04-23'],
        ['E', '2021-04-14', '2023-07-25'],
        ['F', '2021-06-26', '2021-07-12']] 
  
#define column names
columns = ['employee', 'start_date', 'end_date'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+--------+----------+----------+
|employee|start_date|  end_date|
+--------+----------+----------+
|       A|2020-10-25|2023-01-15|
|       B|2013-10-11|2029-01-18|
|       C|2015-10-17|2022-04-15|
|       D|2022-12-21|2023-04-23|
|       E|2021-04-14|2023-07-25|
|       F|2021-06-26|2021-07-12|
+--------+----------+----------+

We can use the following syntax to calculate the date difference between each start and end date in terms of days, months and years:

from pyspark.sql import functions as F

#create new DataFrame with date differences columns
df.withColumn('diff_days', F.datediff(F.to_date('end_date'), F.to_date('start_date')))
  .withColumn('diff_months', F.round(F.months_between(F.to_date('end_date'), F.to_date('start_date')),2))
  .withColumn('diff_years', F.round(F.months_between(F.to_date('end_date'), F.to_date('start_date'))/12,2)).show()

+--------+----------+----------+---------+-----------+----------+
|employee|start_date|  end_date|diff_days|diff_months|diff_years|
+--------+----------+----------+---------+-----------+----------+
|       A|2020-10-25|2023-01-15|      812|      26.68|      2.22|
|       B|2013-10-11|2029-01-18|     5578|     183.23|     15.27|
|       C|2015-10-17|2022-04-15|     2372|      77.94|      6.49|
|       D|2022-12-21|2023-04-23|      123|       4.06|      0.34|
|       E|2021-04-14|2023-07-25|      832|      27.35|      2.28|
|       F|2021-06-26|2021-07-12|       16|       0.55|      0.05|
+--------+----------+----------+---------+-----------+----------+

Notice that three new columns were added to the DataFrame that show the date difference between each start and end date in terms of days, months and years.

For example:

  • The are 812 days between 2020-10-25 and 2023-01-15.
  • The are 26.68 months between 2020-10-25 and 2023-01-15.
  • The are 2.22 years between 2020-10-25 and 2023-01-15.

Note that we use the round function to round the date difference for the months and years to two decimal places but you can choose to round to however many decimal places you’d like.

Also note that we used the withColumn function several times to return a new DataFrame with several columns added and all other existing columns left the same.

You can find the complete documentation for the PySpark withColumn function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x