What is the Method for Finding the Minimum Date in PySpark?


You can use the following methods to find the minimum date (i.e. the earliest date) in a column of a PySpark DataFrame:

Method 1: Find Minimum Date in One Column

from pyspark.sql import functions as F

#find minimum date in sales_date column
df.select(F.min('sales_date').alias('min_date')).show()

Method 2: Find Minimum Date in One Column, Grouped by Another Column

from pyspark.sql import functions as F

#find minimum date in sales_date column, grouped by employee column
df.groupBy('employee').agg(F.min('sales_date').alias('min_date')).show()

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about sales made by various employees at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', '2020-10-25', 15],
        ['A', '2013-10-11', 24],
        ['A', '2015-10-17', 31],
        ['B', '2022-12-21', 27],
        ['B', '2021-04-14', 40],
        ['B', '2021-06-26', 34]] 
  
#define column names
columns = ['employee', 'sales_date', 'total_sales'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+--------+----------+-----------+
|employee|sales_date|total_sales|
+--------+----------+-----------+
|       A|2020-10-25|         15|
|       A|2013-10-11|         24|
|       A|2015-10-17|         31|
|       B|2022-12-21|         27|
|       B|2021-04-14|         40|
|       B|2021-06-26|         34|
+--------+----------+-----------+

Example 1: Find Minimum Date in One Column

We can use the following syntax to find the minimum date (i.e. the earliest date) in the sales_date column:

from pyspark.sql import functions as F

#find minimum date in sales_date column
df.select(F.min('sales_date').alias('min_date')).show()

+----------+
|  min_date|
+----------+
|2013-10-11|
+----------+

We can see that the minimum date in the sales_date column is 2013-10-11.

Note: We used the alias function to rename the column to min_date in the resulting DataFrame.

Example 2: Find Minimum Date in One Column, Grouped by Another Column

We can use the following syntax to find the minimum date in the sales_date column, grouped by the values in the employee column:

from pyspark.sql import functions as F

#find minimum date in sales_date column, grouped by employee column
df.groupBy('employee').agg(F.min('sales_date').alias('min_date')).show()

+--------+----------+
|employee|  min_date|
+--------+----------+
|       A|2013-10-11|
|       B|2021-04-14|
+--------+----------+

The resulting DataFrame shows the minimum sales date (i.e. earliest date) for each unique employee in the DataFrame.

Additional Resources

x