How can I extract the year from a date using PySpark?

Extracting the year from a given date using PySpark is a process that involves using the built-in functions and methods provided by PySpark. This can be achieved by first converting the date into a PySpark DateType, and then using the “year” function to extract the year from the date. This function will return the year as an integer, allowing for further analysis and manipulation of the data. By following these steps, one can easily extract the year from a date using PySpark and incorporate it into their data analysis or processing tasks.

PySpark: Extract Year from Date


You can use the following syntax to extract the year from a date in a PySpark DataFrame:

from pyspark.sql.functions import year

df_new = df.withColumn('year', year(df['date']))

This particular example creates a new column called year that extracts the year from the date in the date column.

The following example shows how to use this syntax in practice.

Example: How to Extract Year from Date in PySpark

Suppose we have the following PySpark DataFrame that contains information about the sales made on various days at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['2021-04-11', 22],
        ['2021-04-15', 14],
        ['2021-04-17', 12],
        ['2022-05-21', 15],
        ['2022-05-23', 30],
        ['2023-10-26', 45],
        ['2023-10-28', 32],
        ['2023-10-29', 47]]
  
#define column names
columns = ['date', 'sales']
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----------+-----+
|      date|sales|
+----------+-----+
|2021-04-11|   22|
|2021-04-15|   14|
|2021-04-17|   12|
|2022-05-21|   15|
|2022-05-23|   30|
|2023-10-26|   45|
|2023-10-28|   32|
|2023-10-29|   47|
+----------+-----+

Suppose we would like to extract the year from each date in the date column.

We can use the following syntax to do so:

from pyspark.sql.functions import year

#extract year from date column
df_new = df.withColumn('year', year(df['date']))

#view new DataFrame
df_new.show()

+----------+-----+----+
|      date|sales|year|
+----------+-----+----+
|2021-04-11|   22|2021|
|2021-04-15|   14|2021|
|2021-04-17|   12|2021|
|2022-05-21|   15|2022|
|2022-05-23|   30|2022|
|2023-10-26|   45|2023|
|2023-10-28|   32|2023|
|2023-10-29|   47|2023|
+----------+-----+----+

The new year column contains the year of each date in the date column.

Note that we used the withColumn function to add a new column called year to the DataFrame while keeping all existing columns the same.

Note: You can find the complete documentation for the PySpark withColumn function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x