How can I convert a column from a date format to a string format in PySpark?

To convert a column from a date format to a string format in PySpark, you can use the “to_string” function. This function converts the date values into string values, allowing for easier manipulation and analysis of the data. By converting the column to a string format, the date values can be sorted, filtered, and transformed using various string functions. This process can be useful when working with large datasets that contain date values, as it provides more flexibility in data analysis. Additionally, the “to_string” function can also be used to transform other data types into strings, such as integers and floats.

PySpark: Convert Column from Date to String


You can use the following syntax to convert a column from a date to a string in PySpark:

from pyspark.sql.functions import date_format

df_new = df.withColumn('date_string', date_format('date', 'MM/dd/yyyy'))

This particular example converts the dates in the date column to strings in a new column called date_string, using MM/dd/yyyy as the date format.

The following example shows how to use this syntax in practice.

Example: How to Convert Column from Date to String in PySpark

Suppose we have the following PySpark DataFrame that contains information about sales made on various days for some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
import datetime

#define data
data = [[datetime.date(2023, 10, 30), 136], 
        [datetime.date(2023, 11, 14), 223], 
        [datetime.date(2023, 11, 22), 450], 
        [datetime.date(2023, 11, 25), 290], 
        [datetime.date(2023, 12, 19), 189]]
  
#define column names
columns = ['date', 'sales'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe with full column content
df.show()

+----------+-----+
|      date|sales|
+----------+-----+
|2023-10-30|  136|
|2023-11-14|  223|
|2023-11-22|  450|
|2023-11-25|  290|
|2023-12-19|  189|
+----------+-----+

We can use the dtypes function to check the data type of each column in the DataFrame:

#check data type of each column
df.dtypes

[('date', 'date'), ('sales', 'bigint')]

We can see that the date column currently has a data type of date.

To convert this column from a date to a string, we can use the following syntax:

from pyspark.sql.functions import date_format

#create new column that converts dates to strings
df_new = df.withColumn('date_string', date_format('date', 'MM/dd/yyyy'))

#view new DataFrame
df_new.show()

+----------+-----+-----------+
|      date|sales|date_string|
+----------+-----+-----------+
|2023-10-30|  136| 10/30/2023|
|2023-11-14|  223| 11/14/2023|
|2023-11-22|  450| 11/22/2023|
|2023-11-25|  290| 11/25/2023|
|2023-12-19|  189| 12/19/2023|
+----------+-----+-----------+

We can use the dtypes function once again to view the data types of each column in the DataFrame:

#check data type of each column
df.dtypes

[('date', 'date'), ('sales', 'bigint'), ('date_string', 'string')]

We can see that the date_string column has a data type of string.

We have successfully created a string column from a date column.

Note: We used MM/dd/yyyy as the date format within the date_format function but feel free to use whatever date format you’d like.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x