How to Find Max Date in PySpark (With Examples)

In PySpark, the max date can be found by using the max() function on a date column. This function will return the maximum value of a date column and can be used for various date operations such as adding/subtracting days from the max date. Examples of this max() function with different date columns can be found online.


You can use the following methods to find the max date (i.e. the latest date) in a column of a PySpark DataFrame:

Method 1: Find Max Date in One Column

from pyspark.sql import functions as F

#find max date in sales_date column
df.select(F.max('sales_date').alias('max_date')).show()

Method 2: Find Max Date in One Column, Grouped by Another Column

from pyspark.sql import functions as F

#find max date in sales_date column, grouped by employee column
df.groupBy('employee').agg(F.max('sales_date').alias('max_date')).show()

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about sales made by various employees at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', '2020-10-25', 15],
        ['A', '2013-10-11', 24],
        ['A', '2015-10-17', 31],
        ['B', '2022-12-21', 27],
        ['B', '2021-04-14', 40],
        ['B', '2021-06-26', 34]] 
  
#define column names
columns = ['employee', 'sales_date', 'total_sales'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+--------+----------+-----------+
|employee|sales_date|total_sales|
+--------+----------+-----------+
|       A|2020-10-25|         15|
|       A|2013-10-11|         24|
|       A|2015-10-17|         31|
|       B|2022-12-21|         27|
|       B|2021-04-14|         40|
|       B|2021-06-26|         34|
+--------+----------+-----------+

Example 1: Find Max Date in One Column

We can use the following syntax to find the maximum date (i.e. the earliest date) in the sales_date column:

from pyspark.sql import functions as F

#find max date in sales_date column
df.select(F.max('sales_date').alias('max_date')).show()

+----------+
|  max_date|
+----------+
|2022-12-21|
+----------+

We can see that the max date in the sales_date column is 2022-12-21.

Note: We used the alias function to rename the column to max_date in the resulting DataFrame.

Example 2: Find Max Date in One Column, Grouped by Another Column

We can use the following syntax to find the max date in the sales_date column, grouped by the values in the employee column:

from pyspark.sql import functions as F

#find max date in sales_date column, grouped by employee column
df.groupBy('employee').agg(F.max('sales_date').alias('max_date')).show()

+--------+----------+
|employee|  max_date|
+--------+----------+
|       A|2020-10-25|
|       B|2022-12-21|
+--------+----------+

The resulting DataFrame shows the max sales date (i.e. latest date) for each unique employee in the DataFrame.

Note: You can also include multiple column names in the groupBy function to group by multiple columns if you’d like.

The following tutorials explain how to perform other common tasks in PySpark:

x