How do I convert a timestamp to a date in PySpark, and can you provide an example?

In PySpark, a timestamp can be easily converted to a date by using the “to_date” function. This function converts the timestamp to a date in the format “yyyy-MM-dd”. An example of this conversion would be:

“`
df = spark.createDataFrame([(‘2021-01-01 12:00:00’,)], [‘timestamp’])
df.select(to_date(df.timestamp).alias(‘date’)).show()

+———-+
| date|
+———-+
|2021-01-01|
+———-+
“`

This function can be used in various PySpark operations to manipulate and analyze data based on dates.

Convert Timestamp to Date in PySpark (With Example)


You can use the following syntax to convert a timestamp column to a date column in a PySpark DataFrame:

from pyspark.sql.types import DateType

df = df.withColumn('my_date', df['my_timestamp'].cast(DateType()))

This particular example creates a new column called my_date that contains the date values from the timestamp values in the my_timestamp column.

The following example shows how to use this syntax in practice.

Example: How to Convert Timestamp to Date in PySpark

Suppose we have the following PySpark DataFrame that contains information about sales made on various timestamps at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.sql import functions as F

#define data
data = [['2023-01-15 04:14:22', 225],
        ['2023-02-24 10:55:01', 260],
        ['2023-07-14 18:34:59', 413],
        ['2023-10-30 22:20:05', 368]] 
  
#define column names
columns = ['ts', 'sales'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)

#convert string column to timestamp
df = df.withColumn('ts', F.to_timestamp('ts', 'yyyy-MM-dd HH:mm:ss'))
  
#view dataframe
df.show()

+-------------------+-----+
|                 ts|sales|
+-------------------+-----+
|2023-01-15 04:14:22|  225|
|2023-02-24 10:55:01|  260|
|2023-07-14 18:34:59|  413|
|2023-10-30 22:20:05|  368|
+-------------------+-----+

We can use the following syntax to display the data type of each column in the DataFrame:

#check data type of each column
df.dtypes

[('ts', 'timestamp'), ('sales', 'bigint')]

We can see that the ts column currently has a data type of timestamp.

To convert this column from a timestamp to a date, we can use the following syntax:

from pyspark.sql.types import DateType

#create date column from timestamp column
df = df.withColumn('new_date', df['ts'].cast(DateType()))

#view updated DataFrame
df.show()

+-------------------+-----+----------+
|                 ts|sales|  new_date|
+-------------------+-----+----------+
|2023-01-15 04:14:22|  225|2023-01-15|
|2023-02-24 10:55:01|  260|2023-02-24|
|2023-07-14 18:34:59|  413|2023-07-14|
|2023-10-30 22:20:05|  368|2023-10-30|
+-------------------+-----+----------+

We can use the dtypes function once again to view the data types of each column in the DataFrame:

#check data type of each column
df.dtypes

[('ts', 'timestamp'), ('sales', 'bigint'), ('new_date', 'date')]

We can see that the new_date column has a data type of date.

We have successfully created a date column from a timestamp column.

Additional Resources

x