Table of Contents

In PySpark, the difference between two times can be calculated by subtracting the timestamps of the two times, or by using the F.date_sub() function to subtract one time from the other. The result will be represented as a timedelta object, and can be formatted to display the difference in days, hours, minutes, and/or seconds. Additionally, the F.datediff() function can be used to return the difference in days between two times.

You can use the following syntax to calculate a difference between two times in a PySpark DataFrame:

from pyspark.sql.functions import col

df_new = df.withColumn('seconds_diff', col('end_time').cast('long') - col('start_time').cast('long'))
           .withColumn('minutes_diff', (col('end_time').cast('long') - col('start_time').cast('long'))/60)
           .withColumn('hours_diff', (col('end_time').cast('long') - col('start_time').cast('long'))/3600)

This particular example calculates the difference between the times in the start_time and end_time columns in a DataFrame in terms of seconds, minutes and hours.

The following example shows how to use this syntax in practice.

Example: How to Calculate Difference Between Two Times in PySpark

Suppose we have the following PySpark DataFrame that contains a column of start times for some activity and a column of end times:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.sql import functions as F

#define data
data = [['2023-01-15 04:14:22', '2023-01-18 04:15:00'],
        ['2023-02-24 10:55:01', '2023-02-24 11:14:30'],
        ['2023-07-14 18:34:59', '2023-07-14 18:35:22'],
        ['2023-10-30 22:20:05', '2023-11-02 07:55:00']] 
  
#define column names
columns = ['start_time', 'end_time'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)

#convert string columns to timestamp
df = df.withColumn('start_time', F.to_timestamp('start_time', 'yyyy-MM-dd HH:mm:ss'))
       .withColumn('end_time', F.to_timestamp('end_time', 'yyyy-MM-dd HH:mm:ss'))
         
#view DataFrame
df.show()

+-------------------+-------------------+
|         start_time|           end_time|
+-------------------+-------------------+
|2023-01-15 04:14:22|2023-01-18 04:15:00|
|2023-02-24 10:55:01|2023-02-24 11:14:30|
|2023-07-14 18:34:59|2023-07-14 18:35:22|
|2023-10-30 22:20:05|2023-11-02 07:55:00|
+-------------------+-------------------+

We can use the following syntax to create a new DataFrame that contains three new columns that display the time difference between each start and end time in terms of seconds, minutes and hours:

from pyspark.sql.functions import col

#create new DataFrame with time differences
df_new = df.withColumn('seconds_diff', col('end_time').cast('long') - col('start_time').cast('long'))
           .withColumn('minutes_diff', (col('end_time').cast('long') - col('start_time').cast('long'))/60)
           .withColumn('hours_diff', (col('end_time').cast('long') - col('start_time').cast('long'))/3600)

#view new DataFrame
df_new.show()

+-------------------+-------------------+------------+-------------------+--------------------+
|         start_time|           end_time|seconds_diff|       minutes_diff|          hours_diff|
+-------------------+-------------------+------------+-------------------+--------------------+
|2023-01-15 04:14:22|2023-01-18 04:15:00|      259238|  4320.633333333333|   72.01055555555556|
|2023-02-24 10:55:01|2023-02-24 11:14:30|        1169| 19.483333333333334| 0.32472222222222225|
|2023-07-14 18:34:59|2023-07-14 18:35:22|          23|0.38333333333333336|0.006388888888888889|
|2023-10-30 22:20:05|2023-11-02 07:55:00|      207295| 3454.9166666666665|  57.581944444444446|
+-------------------+-------------------+------------+-------------------+--------------------+

The resulting DataFrame contains the following three new columns:

seconds_diff: The difference between each start and end time in seconds.
minutes_diff: The difference between each start and end time in minutes.
hours_diff: The difference between each start and end time in hours.

Note that we used the withColumn function three times to return a new DataFrame with three columns added to the existing DataFrame.

You can find the complete documentation for the PySpark withColumn function .

The following tutorials explain how to perform other common tasks in PySpark:

How to calculate difference between two times in PySpark?

Example: How to Calculate Difference Between Two Times in PySpark

Requst a

Scale

Example: How to Calculate Difference Between Two Times in PySpark

Related terms:

Requst a

Scale