How can I use PySpark to filter for rows that contain one of multiple values?

PySpark is a powerful tool for data analysis and manipulation in Python. One useful feature of PySpark is the ability to filter data based on specific criteria. This can be achieved by using the “filter” function, which allows users to specify conditions for selecting rows from a dataset. In order to filter for rows that contain one of multiple values, users can use the “isin” function to create a list of values to be filtered for. By combining the “filter” function with the “isin” function, PySpark can efficiently and effectively filter for rows that contain any of the specified values, making it an efficient tool for data analysis and processing.

PySpark: Filter for Rows that Contain One of Multiple Values


You can use the following syntax to filter for rows in a PySpark DataFrame that contain one of multiple values:

#define array of substrings to search for
my_values = ['ets', 'urs']
regex_values = "|".join(my_values)

filter DataFrame where team column contains any substring from array
df.filter(df.team.rlike(regex_values)).show()

The following example shows how to use this syntax in practice.

Example: Filter for Rows that Contain One of Multiple Values in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 14], 
        ['Nets', 22], 
        ['Nets', 31], 
        ['Cavs', 27], 
        ['Kings', 26], 
        ['Spurs', 40],
        ['Lakers', 23],
        ['Spurs', 17],] 
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+------+------+
|  team|points|
+------+------+
|  Mavs|    14|
|  Nets|    22|
|  Nets|    31|
|  Cavs|    27|
| Kings|    26|
| Spurs|    40|
|Lakers|    23|
| Spurs|    17|
+------+------+

We can use the following syntax to filter the DataFrame to only contain rows where the team column contains “ets” or “urs” somewhere in the string:

#define array of substrings to search for
my_values = ['ets', 'urs']
regex_values = "|".join(my_values)

filter DataFrame where team column contains any substring from array
df.filter(df.team.rlike(regex_values)).show()

+-----+------+
| team|points|
+-----+------+
| Nets|    22|
| Nets|    31|
|Spurs|    40|
+-----+------+

Notice that each of the rows in the resulting DataFrame contains either “ets” or “urs” in the team column.

Note: We used the rlike function to search for partial string matches in the team column. You can find the complete documentation the PySPark rlike function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x