PySpark: Filter for “Not Contains”


You can use the following syntax to filter a PySpark DataFrame by using a “Not Contains” operator:

#filter DataFrame where team does not contain 'avs'
df.filter(~df.team.contains('avs')).show()

The following example shows how to use this syntax in practice.

Example: How to Filter for “Not Contains” in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 14], 
        ['Nets', 22], 
        ['Nets', 31], 
        ['Cavs', 27], 
        ['Kings', 26], 
        ['Spurs', 40],
        ['Lakers', 23],
        ['Spurs', 17],] 
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+------+------+
|  team|points|
+------+------+
|  Mavs|    14|
|  Nets|    22|
|  Nets|    31|
|  Cavs|    27|
| Kings|    26|
| Spurs|    40|
|Lakers|    23|
| Spurs|    17|
+------+------+

We can use the following syntax to filter the DataFrame to only contain rows where the team column does not contain “avs” anywhere in the string:

#filter DataFrame where team does not contain 'avs'
df.filter(~df.team.contains('avs')).show()

+------+------+
|  team|points|
+------+------+
|  Nets|    22|
|  Nets|    31|
| Kings|    26|
| Spurs|    40|
|Lakers|    23|
| Spurs|    17|
+------+------+

Notice that none of the rows in the resulting DataFrame contain “avs” in the team column.

Note that the rows that contained Mavs and Cavs in the team column have both been filtered out since both of these teams contained “avs” in their name.

Note: The contains function is case-sensitive. For example, if you would have used “AVS” then the function would not have filtered out the Mavs and Cavs from the DataFrame.

x