How can I convert a string to a timestamp in PySpark, and can you provide an example?

Converting a string to a timestamp in PySpark involves using the to_timestamp function, which converts a string in a specified format to a timestamp type. An example of this would be using the to_timestamp function with the format ‘yyyy-MM-dd HH:mm:ss’ to convert the string ‘2020-10-10 12:00:00’ to a timestamp. This function can be useful for handling time-related data in PySpark.

Convert String to Timestamp in PySpark (With Example)

You can use the following syntax to convert a string column to a timestamp column in a PySpark DataFrame:

from pyspark.sql import functions as F

df = df.withColumn('ts_new', F.to_timestamp('ts', 'yyyy-MM-dd HH:mm:ss'))

This particular example creates a new column called ts_new that contains timestamp values from the string values in the ts column.

The following example shows how to use this syntax in practice.

Example: How to Convert String to Timestamp in PySpark

Suppose we have the following PySpark DataFrame that contains information about sales made on various timestamps at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['2023-01-15 04:14:22', 225],
        ['2023-02-24 10:55:01', 260],
        ['2023-07-14 18:34:59', 413],
        ['2023-10-30 22:20:05', 368]]  
#define column names
columns = ['ts', 'sales'] 
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
#view dataframe

|                 ts|sales|
|2023-01-15 04:14:22|  225|
|2023-02-24 10:55:01|  260|
|2023-07-14 18:34:59|  413|
|2023-10-30 22:20:05|  368|

We can use the following syntax to display the data type of each column in the DataFrame:

#check data type of each column

[('ts', 'string'), ('sales', 'bigint')]

We can see that the ts column currently has a data type of string.

To convert this column from a string to a timestamp, we can use the following syntax:

from pyspark.sql import functions as F

#convert 'ts' column from string to timestamp
df = df.withColumn('ts_new', F.to_timestamp('ts', 'yyyy-MM-dd HH:mm:ss'))

#view updated DataFrame

|                 ts|sales|             ts_new|
|2023-01-15 04:14:22|  225|2023-01-15 04:14:22|
|2023-02-24 10:55:01|  260|2023-02-24 10:55:01|
|2023-07-14 18:34:59|  413|2023-07-14 18:34:59|
|2023-10-30 22:20:05|  368|2023-10-30 22:20:05|

We can use the dtypes function once again to view the data types of each column in the DataFrame:

#check data type of each column

[('ts', 'string'), ('sales', 'bigint'), ('ts_new', 'timestamp')]

We can see that the new column called ts_new has a data type of timestamp.

We have successfully converted a string column to a timestamp column.

Additional Resources
