Table of Contents
You can use the following methods to find the minimum date (i.e. the earliest date) in a column of a PySpark DataFrame:
Method 1: Find Minimum Date in One Column
from pyspark.sql import functions as F
#find minimum date in sales_date column
df.select(F.min('sales_date').alias('min_date')).show()
Method 2: Find Minimum Date in One Column, Grouped by Another Column
from pyspark.sql import functions as F
#find minimum date in sales_date column, grouped by employee column
df.groupBy('employee').agg(F.min('sales_date').alias('min_date')).show()
The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about sales made by various employees at some company:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', '2020-10-25', 15],
['A', '2013-10-11', 24],
['A', '2015-10-17', 31],
['B', '2022-12-21', 27],
['B', '2021-04-14', 40],
['B', '2021-06-26', 34]]
#define column names
columns = ['employee', 'sales_date', 'total_sales']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+--------+----------+-----------+
|employee|sales_date|total_sales|
+--------+----------+-----------+
| A|2020-10-25| 15|
| A|2013-10-11| 24|
| A|2015-10-17| 31|
| B|2022-12-21| 27|
| B|2021-04-14| 40|
| B|2021-06-26| 34|
+--------+----------+-----------+
Example 1: Find Minimum Date in One Column
We can use the following syntax to find the minimum date (i.e. the earliest date) in the sales_date column:
from pyspark.sql import functions as F
#find minimum date in sales_date column
df.select(F.min('sales_date').alias('min_date')).show()
+----------+
| min_date|
+----------+
|2013-10-11|
+----------+
We can see that the minimum date in the sales_date column is 2013-10-11.
Note: We used the alias function to rename the column to min_date in the resulting DataFrame.
Example 2: Find Minimum Date in One Column, Grouped by Another Column
We can use the following syntax to find the minimum date in the sales_date column, grouped by the values in the employee column:
from pyspark.sql import functions as F
#find minimum date in sales_date column, grouped by employee column
df.groupBy('employee').agg(F.min('sales_date').alias('min_date')).show()
+--------+----------+
|employee| min_date|
+--------+----------+
| A|2013-10-11|
| B|2021-04-14|
+--------+----------+
The resulting DataFrame shows the minimum sales date (i.e. earliest date) for each unique employee in the DataFrame.
Additional Resources