How can I find the maximum date in PySpark, and what are some examples of how to do so?

In PySpark, the maximum date can be found by using the max() function. This function takes in a column or list of columns and returns the maximum value in that column. To find the maximum date, the column containing the date must be specified. Some examples of how to use the max() function to find the maximum date include using it with the select() function to select the date column and then using the max() function on that selected column. Another method is to use the agg() function with the max() function to aggregate the maximum date from the specified column. Overall, the max() function is a useful tool in PySpark for finding the maximum date in a dataset.

Find Max Date in PySpark (With Examples)


You can use the following methods to find the max date (i.e. the latest date) in a column of a PySpark DataFrame:

Method 1: Find Max Date in One Column

from pyspark.sql import functions as F

#find max date in sales_date column
df.select(F.max('sales_date').alias('max_date')).show()

Method 2: Find Max Date in One Column, Grouped by Another Column

from pyspark.sql import functions as F

#find max date in sales_date column, grouped by employee column
df.groupBy('employee').agg(F.max('sales_date').alias('max_date')).show()

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about sales made by various employees at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', '2020-10-25', 15],
        ['A', '2013-10-11', 24],
        ['A', '2015-10-17', 31],
        ['B', '2022-12-21', 27],
        ['B', '2021-04-14', 40],
        ['B', '2021-06-26', 34]] 
  
#define column names
columns = ['employee', 'sales_date', 'total_sales'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+--------+----------+-----------+
|employee|sales_date|total_sales|
+--------+----------+-----------+
|       A|2020-10-25|         15|
|       A|2013-10-11|         24|
|       A|2015-10-17|         31|
|       B|2022-12-21|         27|
|       B|2021-04-14|         40|
|       B|2021-06-26|         34|
+--------+----------+-----------+

Example 1: Find Max Date in One Column

We can use the following syntax to find the maximum date (i.e. the earliest date) in the sales_date column:

from pyspark.sql import functions as F

#find max date in sales_date column
df.select(F.max('sales_date').alias('max_date')).show()

+----------+
|  max_date|
+----------+
|2022-12-21|
+----------+

We can see that the max date in the sales_date column is 2022-12-21.

Note: We used the alias function to rename the column to max_date in the resulting DataFrame.

Example 2: Find Max Date in One Column, Grouped by Another Column

We can use the following syntax to find the max date in the sales_date column, grouped by the values in the employee column:

from pyspark.sql import functions as F

#find max date in sales_date column, grouped by employee column
df.groupBy('employee').agg(F.max('sales_date').alias('max_date')).show()

+--------+----------+
|employee|  max_date|
+--------+----------+
|       A|2020-10-25|
|       B|2022-12-21|
+--------+----------+

The resulting DataFrame shows the max sales date (i.e. latest date) for each unique employee in the DataFrame.

Note: You can also include multiple column names in the groupBy function to group by multiple columns if you’d like.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x