PySpark: Filter Using “Contains”


You can use the following syntax to filter a PySpark DataFrame using a “contains” operator:

#filter DataFrame where team column contains 'avs'
df.filter(df.team.contains('avs')).show()

The following example shows how to use this syntax in practice.

Example: How to Filter Using “Contains” in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 14], 
        ['Nets', 22], 
        ['Nets', 31], 
        ['Cavs', 27], 
        ['Kings', 26], 
        ['Spurs', 40],
        ['Lakers', 23],
        ['Spurs', 17],] 
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+------+------+
|  team|points|
+------+------+
|  Mavs|    14|
|  Nets|    22|
|  Nets|    31|
|  Cavs|    27|
| Kings|    26|
| Spurs|    40|
|Lakers|    23|
| Spurs|    17|
+------+------+

We can use the following syntax to filter the DataFrame to only contain rows where the team column contains “avs” somewhere in the string:

#filter DataFrame where team column contains 'avs'
df.filter(df.team.contains('avs')).show()

+----+------+
|team|points|
+----+------+
|Mavs|    14|
|Cavs|    27|
+----+------+

Notice that each of the rows in the resulting DataFrame contain “avs” in the team column.

No other rows contained “avs” in the team column, which is why all other rows were filtered out of the DataFrame.

Note: The contains function is case-sensitive. For example, if you would have used “AVS” then the filter would not have returned any rows because no team name contained “AVS” in all uppercase letters.

x