How can I filter by date range in PySpark using an example?

Filtering by date range in PySpark can be accomplished by using the “between” function in the “pyspark.sql.functions” library. This function allows for specifying a start and end date, and will return any records that fall within that range. For example, if we have a dataset with a “date” column, we can use the “between” function to filter for records between January 1st, 2020 and December 31st, 2020. This will return all records from that year, allowing for easy analysis and manipulation of data within a specific time frame.


You can use the following syntax to filter rows in a PySpark DataFrame based on a date range:

#specify start and end dates
dates = ('2019-01-01', '2022-01-01')

#filter DataFrame to only show rows between start and end dates
df.filter(df.start_date.between(*dates)).show()

This particular example filters the DataFrame to only contain rows where the date in the start_date column of the DataFrame is between 2019-01-01 and 2022-01-01.

The following example shows how to use this syntax in practice.

Example: How to Filter by Date Range in PySpark

Suppose we have the following PySpark DataFrame that contains information about the start date for various employees at a company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', '2017-10-25'],
        ['B', '2018-10-11'],
        ['C', '2018-10-17'],
        ['D', '2019-12-21'],
        ['E', '2021-04-14'],
        ['F', '2022-06-26']] 
  
#define column names
columns = ['employee', 'start_date']
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+--------+----------+
|employee|start_date|
+--------+----------+
|       A|2017-10-25|
|       B|2018-10-11|
|       C|2018-10-17|
|       D|2019-12-21|
|       E|2021-04-14|
|       F|2022-06-26|
+--------+----------+

We can use the following syntax to filter the DataFrame to only contain rows where the date in the start_date column of the DataFrame is between 2019-01-01 and 2022-01-01:

#specify start and end dates
dates = ('2019-01-01', '2022-01-01')

#filter DataFrame to only show rows between start and end dates
df.filter(df.start_date.between(*dates)).show()

+--------+----------+
|employee|start_date|
+--------+----------+
|       D|2019-12-21|
|       E|2021-04-14|
+--------+----------+

Notice that the DataFrame has been filtered to only show the rows with the two dates in the start_date column that fall between 2019-01-01 and 2022-01-01.

Note that if you only want to know how many rows have a date within a specific date range, then you can use the count function as follows:

#specify start and end dates
dates = ('2019-01-01', '2022-01-01')

#count number of rows in DataFrame that fall between start and end dates
df.filter(df.start_date.between(*dates)).count()

2

This tells us that there are two rows in the DataFrame where the date in the start_date column falls between 2019-01-01 and 2022-01-01.

Note: You can find the complete documentation for the PySpark between function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x