Table of Contents
You can use the following syntax to filter for rows in a PySpark DataFrame that contain one of multiple values:
#define array of substrings to search for my_values = ['ets', 'urs'] regex_values = "|".join(my_values) filter DataFrame where team column contains any substring from array df.filter(df.team.rlike(regex_values)).show()
The following example shows how to use this syntax in practice.
Example: Filter for Rows that Contain One of Multiple Values in PySpark
Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['Mavs', 14],
['Nets', 22],
['Nets', 31],
['Cavs', 27],
['Kings', 26],
['Spurs', 40],
['Lakers', 23],
['Spurs', 17],]
#define column names
columns = ['team', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+------+------+
| team|points|
+------+------+
| Mavs| 14|
| Nets| 22|
| Nets| 31|
| Cavs| 27|
| Kings| 26|
| Spurs| 40|
|Lakers| 23|
| Spurs| 17|
+------+------+
We can use the following syntax to filter the DataFrame to only contain rows where the team column contains “ets” or “urs” somewhere in the string:
#define array of substrings to search for my_values = ['ets', 'urs'] regex_values = "|".join(my_values) filter DataFrame where team column contains any substring from array df.filter(df.team.rlike(regex_values)).show() +-----+------+ | team|points| +-----+------+ | Nets| 22| | Nets| 31| |Spurs| 40| +-----+------+
Notice that each of the rows in the resulting DataFrame contains either “ets” or “urs” in the team column.
Note: We used the rlike function to search for partial string matches in the team column. You can find the complete documentation the PySPark rlike function .