Table of Contents
In PySpark, the difference between two times can be calculated by subtracting the timestamps of the two times, or by using the F.date_sub() function to subtract one time from the other. The result will be represented as a timedelta object, and can be formatted to display the difference in days, hours, minutes, and/or seconds. Additionally, the F.datediff() function can be used to return the difference in days between two times.
You can use the following syntax to calculate a difference between two times in a PySpark DataFrame:
from pyspark.sql.functions import col
df_new = df.withColumn('seconds_diff', col('end_time').cast('long') - col('start_time').cast('long'))
.withColumn('minutes_diff', (col('end_time').cast('long') - col('start_time').cast('long'))/60)
.withColumn('hours_diff', (col('end_time').cast('long') - col('start_time').cast('long'))/3600)
This particular example calculates the difference between the times in the start_time and end_time columns in a DataFrame in terms of seconds, minutes and hours.
The following example shows how to use this syntax in practice.
Example: How to Calculate Difference Between Two Times in PySpark
Suppose we have the following PySpark DataFrame that contains a column of start times for some activity and a column of end times:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
from pyspark.sql import functions as F
#define data
data = [['2023-01-15 04:14:22', '2023-01-18 04:15:00'],
['2023-02-24 10:55:01', '2023-02-24 11:14:30'],
['2023-07-14 18:34:59', '2023-07-14 18:35:22'],
['2023-10-30 22:20:05', '2023-11-02 07:55:00']]
#define column names
columns = ['start_time', 'end_time']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#convert string columns to timestamp
df = df.withColumn('start_time', F.to_timestamp('start_time', 'yyyy-MM-dd HH:mm:ss'))
.withColumn('end_time', F.to_timestamp('end_time', 'yyyy-MM-dd HH:mm:ss'))
#view DataFrame
df.show()
+-------------------+-------------------+
| start_time| end_time|
+-------------------+-------------------+
|2023-01-15 04:14:22|2023-01-18 04:15:00|
|2023-02-24 10:55:01|2023-02-24 11:14:30|
|2023-07-14 18:34:59|2023-07-14 18:35:22|
|2023-10-30 22:20:05|2023-11-02 07:55:00|
+-------------------+-------------------+
We can use the following syntax to create a new DataFrame that contains three new columns that display the time difference between each start and end time in terms of seconds, minutes and hours:
from pyspark.sql.functions import col
#create new DataFrame with time differences
df_new = df.withColumn('seconds_diff', col('end_time').cast('long') - col('start_time').cast('long'))
.withColumn('minutes_diff', (col('end_time').cast('long') - col('start_time').cast('long'))/60)
.withColumn('hours_diff', (col('end_time').cast('long') - col('start_time').cast('long'))/3600)
#view new DataFrame
df_new.show()
+-------------------+-------------------+------------+-------------------+--------------------+
| start_time| end_time|seconds_diff| minutes_diff| hours_diff|
+-------------------+-------------------+------------+-------------------+--------------------+
|2023-01-15 04:14:22|2023-01-18 04:15:00| 259238| 4320.633333333333| 72.01055555555556|
|2023-02-24 10:55:01|2023-02-24 11:14:30| 1169| 19.483333333333334| 0.32472222222222225|
|2023-07-14 18:34:59|2023-07-14 18:35:22| 23|0.38333333333333336|0.006388888888888889|
|2023-10-30 22:20:05|2023-11-02 07:55:00| 207295| 3454.9166666666665| 57.581944444444446|
+-------------------+-------------------+------------+-------------------+--------------------+
The resulting DataFrame contains the following three new columns:
- seconds_diff: The difference between each start and end time in seconds.
- minutes_diff: The difference between each start and end time in minutes.
- hours_diff: The difference between each start and end time in hours.
Note that we used the withColumn function three times to return a new DataFrame with three columns added to the existing DataFrame.
You can find the complete documentation for the PySpark withColumn function .
The following tutorials explain how to perform other common tasks in PySpark: