Table of Contents
In PySpark, the max date can be found by using the max() function on a date column. This function will return the maximum value of a date column and can be used for various date operations such as adding/subtracting days from the max date. Examples of this max() function with different date columns can be found online.
You can use the following methods to find the max date (i.e. the latest date) in a column of a PySpark DataFrame:
Method 1: Find Max Date in One Column
from pyspark.sql import functions as F
#find max date in sales_date column
df.select(F.max('sales_date').alias('max_date')).show()
Method 2: Find Max Date in One Column, Grouped by Another Column
from pyspark.sql import functions as F
#find max date in sales_date column, grouped by employee column
df.groupBy('employee').agg(F.max('sales_date').alias('max_date')).show()
The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about sales made by various employees at some company:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', '2020-10-25', 15], ['A', '2013-10-11', 24], ['A', '2015-10-17', 31], ['B', '2022-12-21', 27], ['B', '2021-04-14', 40], ['B', '2021-06-26', 34]] #define column names columns = ['employee', 'sales_date', 'total_sales'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +--------+----------+-----------+ |employee|sales_date|total_sales| +--------+----------+-----------+ | A|2020-10-25| 15| | A|2013-10-11| 24| | A|2015-10-17| 31| | B|2022-12-21| 27| | B|2021-04-14| 40| | B|2021-06-26| 34| +--------+----------+-----------+
Example 1: Find Max Date in One Column
We can use the following syntax to find the maximum date (i.e. the earliest date) in the sales_date column:
from pyspark.sql import functions as F
#find max date in sales_date column
df.select(F.max('sales_date').alias('max_date')).show()
+----------+
| max_date|
+----------+
|2022-12-21|
+----------+
We can see that the max date in the sales_date column is 2022-12-21.
Note: We used the alias function to rename the column to max_date in the resulting DataFrame.
Example 2: Find Max Date in One Column, Grouped by Another Column
We can use the following syntax to find the max date in the sales_date column, grouped by the values in the employee column:
from pyspark.sql import functions as F
#find max date in sales_date column, grouped by employee column
df.groupBy('employee').agg(F.max('sales_date').alias('max_date')).show()
+--------+----------+
|employee| max_date|
+--------+----------+
| A|2020-10-25|
| B|2022-12-21|
+--------+----------+
The resulting DataFrame shows the max sales date (i.e. latest date) for each unique employee in the DataFrame.
Note: You can also include multiple column names in the groupBy function to group by multiple columns if you’d like.
The following tutorials explain how to perform other common tasks in PySpark: