How can I use PySpark to create a date column from year, month, and day columns?

PySpark is a powerful open-source framework that allows users to efficiently process large datasets using the Python programming language. One of its key features is the ability to manipulate and transform data, including creating new columns. To create a date column from separate year, month, and day columns in PySpark, users can utilize the built-in datetime functions to convert the individual columns into a single date format. This can be achieved by defining the data types of the individual columns, combining them using the datetime functions, and then creating a new column with the desired date format. This process can be easily implemented within a PySpark script, making it a convenient and efficient solution for creating date columns from existing data.

PySpark: Create Date Column from Year, Month and Day


You can use the following syntax to create a date column from year, month and day columns in a PySpark DataFrame:

from pyspark.sql import functions as F

df_new = df.withColumn('date', F.make_date('year', 'month', 'day'))

This particular example creates a new column called date by using the values in the year, month and day columns.

The following example shows how to use this syntax in practice.

Example: Create Date Column from Year, Month and Day

Suppose we have the following PySpark DataFrame that contains three columns to represent the year, month and day of a given date:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[2021, 10, 30], 
        [2021, 12, 3], 
        [2022, 1, 14], 
        [2022, 3, 22], 
        [2022, 5, 24], 
        [2023, 3, 21],
        [2023, 7, 18],
        [2023, 11, 4]] 
  
#define column names
columns = ['year', 'month', 'day'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+-----+---+
|year|month|day|
+----+-----+---+
|2021|   10| 30|
|2021|   12|  3|
|2022|    1| 14|
|2022|    3| 22|
|2022|    5| 24|
|2023|    3| 21|
|2023|    7| 18|
|2023|   11|  4|
+----+-----+---+

We can use the following syntax to create a new column called date that creates a date from the existing values in the year, month and day columns:

from pyspark.sql import functions as F

#create new DataFrame with 'date' column
df_new = df.withColumn('date', F.make_date('year', 'month', 'day'))

#view new DataFrame
df_new.show()

+----+-----+---+----------+
|year|month|day|      date|
+----+-----+---+----------+
|2021|   10| 30|2021-10-30|
|2021|   12|  3|2021-12-03|
|2022|    1| 14|2022-01-14|
|2022|    3| 22|2022-03-22|
|2022|    5| 24|2022-05-24|
|2023|    3| 21|2023-03-21|
|2023|    7| 18|2023-07-18|
|2023|   11|  4|2023-11-04|
+----+-----+---+----------+

Notice that the new DataFrame contains a date column with dates created from the existing year, month and day columns.

We can use the following syntax to verify that the data type of the new date column is indeed a date:

#check data type of new 'date' column
dict(df_new.dtypes)['date']

'date'

The new column does indeed have a data type of date.

Note that  we used the withColumn function to return a new DataFrame with a new column added and all other existing columns left the same.

You can find the complete documentation for the PySpark withColumn function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x