Convert Timestamp to Date in PySpark (With Example)


You can use the following syntax to convert a timestamp column to a date column in a PySpark DataFrame:

from pyspark.sql.types import DateType

df = df.withColumn('my_date', df['my_timestamp'].cast(DateType()))

This particular example creates a new column called my_date that contains the date values from the timestamp values in the my_timestamp column.

The following example shows how to use this syntax in practice.

Example: How to Convert Timestamp to Date in PySpark

Suppose we have the following PySpark DataFrame that contains information about sales made on various timestamps at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.sql import functions as F

#define data
data = [['2023-01-15 04:14:22', 225],
        ['2023-02-24 10:55:01', 260],
        ['2023-07-14 18:34:59', 413],
        ['2023-10-30 22:20:05', 368]] 
  
#define column names
columns = ['ts', 'sales'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)

#convert string column to timestamp
df = df.withColumn('ts', F.to_timestamp('ts', 'yyyy-MM-dd HH:mm:ss'))
  
#view dataframe
df.show()

+-------------------+-----+
|                 ts|sales|
+-------------------+-----+
|2023-01-15 04:14:22|  225|
|2023-02-24 10:55:01|  260|
|2023-07-14 18:34:59|  413|
|2023-10-30 22:20:05|  368|
+-------------------+-----+

We can use the following syntax to display the data type of each column in the DataFrame:

#check data type of each column
df.dtypes

[('ts', 'timestamp'), ('sales', 'bigint')]

We can see that the ts column currently has a data type of timestamp.

To convert this column from a timestamp to a date, we can use the following syntax:

from pyspark.sql.types import DateType

#create date column from timestamp column
df = df.withColumn('new_date', df['ts'].cast(DateType()))

#view updated DataFrame
df.show()

+-------------------+-----+----------+
|                 ts|sales|  new_date|
+-------------------+-----+----------+
|2023-01-15 04:14:22|  225|2023-01-15|
|2023-02-24 10:55:01|  260|2023-02-24|
|2023-07-14 18:34:59|  413|2023-07-14|
|2023-10-30 22:20:05|  368|2023-10-30|
+-------------------+-----+----------+

We can use the dtypes function once again to view the data types of each column in the DataFrame:

#check data type of each column
df.dtypes

[('ts', 'timestamp'), ('sales', 'bigint'), ('new_date', 'date')]

We can see that the new_date column has a data type of date.

We have successfully created a date column from a timestamp column.

x