Table of Contents

You can use the following syntax to filter for rows in a PySpark DataFrame that contain one of multiple values:

#define array of substrings to search for
my_values = ['ets', 'urs']
regex_values = "|".join(my_values)

filter DataFrame where team column contains any substring from array
df.filter(df.team.rlike(regex_values)).show()

The following example shows how to use this syntax in practice.

Example: Filter for Rows that Contain One of Multiple Values in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 14], 
        ['Nets', 22], 
        ['Nets', 31], 
        ['Cavs', 27], 
        ['Kings', 26], 
        ['Spurs', 40],
        ['Lakers', 23],
        ['Spurs', 17],] 
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+------+------+
|  team|points|
+------+------+
|  Mavs|    14|
|  Nets|    22|
|  Nets|    31|
|  Cavs|    27|
| Kings|    26|
| Spurs|    40|
|Lakers|    23|
| Spurs|    17|
+------+------+

We can use the following syntax to filter the DataFrame to only contain rows where the team column contains “ets” or “urs” somewhere in the string:

#define array of substrings to search for
my_values = ['ets', 'urs']
regex_values = "|".join(my_values)

filter DataFrame where team column contains any substring from array
df.filter(df.team.rlike(regex_values)).show()

+-----+------+
| team|points|
+-----+------+
| Nets|    22|
| Nets|    31|
|Spurs|    40|
+-----+------+

Notice that each of the rows in the resulting DataFrame contains either “ets” or “urs” in the team column.

Note: We used the rlike function to search for partial string matches in the team column. You can find the complete documentation the PySPark rlike function .

PySpark: Filter for Rows that Contain One of Multiple Values

Example: Filter for Rows that Contain One of Multiple Values in PySpark

Requst a

Scale

Example: Filter for Rows that Contain One of Multiple Values in PySpark

Related terms:

Requst a

Scale