How to calculate difference between two times in PySpark?

In PySpark, the difference between two times can be calculated by subtracting the timestamps of the two times, or by using the F.date_sub() function to subtract one time from the other. The result will be represented as a timedelta object, and can be formatted to display the difference in days, hours, minutes, and/or seconds. Additionally, the F.datediff() function can be used to return the difference in days between two times.


You can use the following syntax to calculate a difference between two times in a PySpark DataFrame:

from pyspark.sql.functions import col

df_new = df.withColumn('seconds_diff', col('end_time').cast('long') - col('start_time').cast('long'))
           .withColumn('minutes_diff', (col('end_time').cast('long') - col('start_time').cast('long'))/60)
           .withColumn('hours_diff', (col('end_time').cast('long') - col('start_time').cast('long'))/3600)

This particular example calculates the difference between the times in the start_time and end_time columns in a DataFrame in terms of seconds, minutes and hours.

The following example shows how to use this syntax in practice.

Example: How to Calculate Difference Between Two Times in PySpark

Suppose we have the following PySpark DataFrame that contains a column of start times for some activity and a column of end times:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.sql import functions as F

#define data
data = [['2023-01-15 04:14:22', '2023-01-18 04:15:00'],
        ['2023-02-24 10:55:01', '2023-02-24 11:14:30'],
        ['2023-07-14 18:34:59', '2023-07-14 18:35:22'],
        ['2023-10-30 22:20:05', '2023-11-02 07:55:00']] 
  
#define column names
columns = ['start_time', 'end_time'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)

#convert string columns to timestamp
df = df.withColumn('start_time', F.to_timestamp('start_time', 'yyyy-MM-dd HH:mm:ss'))
       .withColumn('end_time', F.to_timestamp('end_time', 'yyyy-MM-dd HH:mm:ss'))
         
#view DataFrame
df.show()

+-------------------+-------------------+
|         start_time|           end_time|
+-------------------+-------------------+
|2023-01-15 04:14:22|2023-01-18 04:15:00|
|2023-02-24 10:55:01|2023-02-24 11:14:30|
|2023-07-14 18:34:59|2023-07-14 18:35:22|
|2023-10-30 22:20:05|2023-11-02 07:55:00|
+-------------------+-------------------+

We can use the following syntax to create a new DataFrame that contains three new columns that display the time difference between each start and end time in terms of seconds, minutes and hours:

from pyspark.sql.functions import col
#create new DataFrame with time differences
df_new = df.withColumn('seconds_diff', col('end_time').cast('long') - col('start_time').cast('long'))
           .withColumn('minutes_diff', (col('end_time').cast('long') - col('start_time').cast('long'))/60)
           .withColumn('hours_diff', (col('end_time').cast('long') - col('start_time').cast('long'))/3600)

#view new DataFrame
df_new.show()

+-------------------+-------------------+------------+-------------------+--------------------+
|         start_time|           end_time|seconds_diff|       minutes_diff|          hours_diff|
+-------------------+-------------------+------------+-------------------+--------------------+
|2023-01-15 04:14:22|2023-01-18 04:15:00|      259238|  4320.633333333333|   72.01055555555556|
|2023-02-24 10:55:01|2023-02-24 11:14:30|        1169| 19.483333333333334| 0.32472222222222225|
|2023-07-14 18:34:59|2023-07-14 18:35:22|          23|0.38333333333333336|0.006388888888888889|
|2023-10-30 22:20:05|2023-11-02 07:55:00|      207295| 3454.9166666666665|  57.581944444444446|
+-------------------+-------------------+------------+-------------------+--------------------+

The resulting DataFrame contains the following three new columns:

  • seconds_diff: The difference between each start and end time in seconds.
  • minutes_diff: The difference between each start and end time in minutes.
  • hours_diff: The difference between each start and end time in hours.

Note that we used the withColumn function three times to return a new DataFrame with three columns added to the existing DataFrame.

You can find the complete documentation for the PySpark withColumn function .

The following tutorials explain how to perform other common tasks in PySpark:

x