Table of Contents
To filter a PySpark dataframe by a specific date range, you can use the “filter” function and specify the desired date range using the “between” condition. For example, if we have a PySpark dataframe named “df” with a column “date” containing dates in the format “YYYY-MM-DD”, we can filter it by a specific date range as follows:
df.filter(df.date.between(‘2021-01-01’, ‘2021-03-31’))
This will return a new dataframe with only the rows where the “date” column falls between the specified dates of January 1, 2021 and March 31, 2021. This allows for filtering by a specific date range in a PySpark dataframe.
Filter by Date Range in PySpark (With Example)
You can use the following syntax to filter rows in a PySpark DataFrame based on a date range:
#specify start and end dates dates = ('2019-01-01', '2022-01-01') #filter DataFrame to only show rows between start and end dates df.filter(df.start_date.between(*dates)).show()
This particular example filters the DataFrame to only contain rows where the date in the start_date column of the DataFrame is between 2019-01-01 and 2022-01-01.
The following example shows how to use this syntax in practice.
Example: How to Filter by Date Range in PySpark
Suppose we have the following PySpark DataFrame that contains information about the start date for various employees at a company:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', '2017-10-25'], ['B', '2018-10-11'], ['C', '2018-10-17'], ['D', '2019-12-21'], ['E', '2021-04-14'], ['F', '2022-06-26']] #define column names columns = ['employee', 'start_date'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +--------+----------+ |employee|start_date| +--------+----------+ | A|2017-10-25| | B|2018-10-11| | C|2018-10-17| | D|2019-12-21| | E|2021-04-14| | F|2022-06-26| +--------+----------+
We can use the following syntax to filter the DataFrame to only contain rows where the date in the start_date column of the DataFrame is between 2019-01-01 and 2022-01-01:
#specify start and end dates dates = ('2019-01-01', '2022-01-01') #filter DataFrame to only show rows between start and end dates df.filter(df.start_date.between(*dates)).show() +--------+----------+ |employee|start_date| +--------+----------+ | D|2019-12-21| | E|2021-04-14| +--------+----------+
Notice that the DataFrame has been filtered to only show the rows with the two dates in the start_date column that fall between 2019-01-01 and 2022-01-01.
Note that if you only want to know how many rows have a date within a specific date range, then you can use the count function as follows:
#specify start and end dates dates = ('2019-01-01', '2022-01-01') #count number of rows in DataFrame that fall between start and end dates df.filter(df.start_date.between(*dates)).count() 2
This tells us that there are two rows in the DataFrame where the date in the start_date column falls between 2019-01-01 and 2022-01-01.
Note: You can find the complete documentation for the PySpark between function .
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark: