How can I create a Date Column from Year, Month and Day using PySpark?

To create a Date Column from Year, Month, and Day using PySpark, you can use the “to_date” function. This function takes in three parameters – the year, month, and day columns – and converts them into a single Date column. This new Date column can then be used for analysis or further transformations. This approach allows for easy manipulation and handling of date-related data in PySpark.


You can use the following syntax to create a date column from year, month and day columns in a PySpark DataFrame:

from pyspark.sql import functions as F

df_new = df.withColumn('date', F.make_date('year', 'month', 'day'))

This particular example creates a new column called date by using the values in the year, month and day columns.

The following example shows how to use this syntax in practice.

Example: Create Date Column from Year, Month and Day

Suppose we have the following PySpark DataFrame that contains three columns to represent the year, month and day of a given date:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[2021, 10, 30], 
        [2021, 12, 3], 
        [2022, 1, 14], 
        [2022, 3, 22], 
        [2022, 5, 24], 
        [2023, 3, 21],
        [2023, 7, 18],
        [2023, 11, 4]] 
  
#define column names
columns = ['year', 'month', 'day'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+-----+---+
|year|month|day|
+----+-----+---+
|2021|   10| 30|
|2021|   12|  3|
|2022|    1| 14|
|2022|    3| 22|
|2022|    5| 24|
|2023|    3| 21|
|2023|    7| 18|
|2023|   11|  4|
+----+-----+---+

We can use the following syntax to create a new column called date that creates a date from the existing values in the year, month and day columns:

from pyspark.sql import functions as F

#create new DataFrame with 'date' column
df_new = df.withColumn('date', F.make_date('year', 'month', 'day'))

#view new DataFrame
df_new.show()

+----+-----+---+----------+
|year|month|day|      date|
+----+-----+---+----------+
|2021|   10| 30|2021-10-30|
|2021|   12|  3|2021-12-03|
|2022|    1| 14|2022-01-14|
|2022|    3| 22|2022-03-22|
|2022|    5| 24|2022-05-24|
|2023|    3| 21|2023-03-21|
|2023|    7| 18|2023-07-18|
|2023|   11|  4|2023-11-04|
+----+-----+---+----------+

Notice that the new DataFrame contains a date column with dates created from the existing year, month and day columns.

We can use the following syntax to verify that the data type of the new date column is indeed a date:

#check data type of new 'date' column
dict(df_new.dtypes)['date']

'date'

The new column does indeed have a data type of date.

Note that  we used the withColumn function to return a new DataFrame with a new column added and all other existing columns left the same.

You can find the complete documentation for the PySpark withColumn function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x