How can I use PySpark to filter for values that do not contain a specific substring or pattern?

PySpark is a powerful tool for data analysis and manipulation that allows users to filter for specific values in a dataset. One useful feature of PySpark is the ability to filter for values that do not contain a specific substring or pattern. This can be achieved by using the “not like” or “not rlike” functions, which allow users to specify a pattern to be excluded from the filtered results. By using these functions, users can easily exclude unwanted data from their analysis and focus on the relevant information. This feature makes PySpark a valuable tool for data cleaning and refining, as it provides a convenient way to remove unwanted data without having to manually search for and remove each individual record.

PySpark: Filter for “Not Contains”


You can use the following syntax to filter a PySpark DataFrame by using a “Not Contains” operator:

#filter DataFrame where team does not contain 'avs'
df.filter(~df.team.contains('avs')).show()

The following example shows how to use this syntax in practice.

Example: How to Filter for “Not Contains” in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 14], 
        ['Nets', 22], 
        ['Nets', 31], 
        ['Cavs', 27], 
        ['Kings', 26], 
        ['Spurs', 40],
        ['Lakers', 23],
        ['Spurs', 17],] 
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+------+------+
|  team|points|
+------+------+
|  Mavs|    14|
|  Nets|    22|
|  Nets|    31|
|  Cavs|    27|
| Kings|    26|
| Spurs|    40|
|Lakers|    23|
| Spurs|    17|
+------+------+

We can use the following syntax to filter the DataFrame to only contain rows where the team column does not contain “avs” anywhere in the string:

#filter DataFrame where team does not contain 'avs'
df.filter(~df.team.contains('avs')).show()

+------+------+
|  team|points|
+------+------+
|  Nets|    22|
|  Nets|    31|
| Kings|    26|
| Spurs|    40|
|Lakers|    23|
| Spurs|    17|
+------+------+

Notice that none of the rows in the resulting DataFrame contain “avs” in the team column.

Note that the rows that contained Mavs and Cavs in the team column have both been filtered out since both of these teams contained “avs” in their name.

Note: The contains function is case-sensitive. For example, if you would have used “AVS” then the function would not have filtered out the Mavs and Cavs from the DataFrame.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x