How do I filter rows in PySpark using the NOT LIKE operator?

The NOT LIKE operator in PySpark is used to filter rows in a dataframe based on a specific pattern or string that is not present in the data. This operator allows you to exclude rows that do not meet the specified criteria, providing a more refined view of the data. To use the NOT LIKE operator, you need to specify the column and the pattern to be matched, using the “NOT LIKE” keyword. This will return all rows that do not contain the specified pattern. This is a useful tool for data analysis and manipulation in PySpark, allowing for more precise filtering and data selection.

PySpark: Filter Rows Using NOT LIKE


You can use the following syntax to filter a PySpark DataFrame using a NOT LIKE operator:

df.filter(~df.team.like('%avs%')).show()

This particular example filters the DataFrame to only show rows where the string in the team column does not have a pattern like “avs” somewhere in the string.

The following example shows how to use this syntax in practice.

Example: How to Filter Using NOT LIKE in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12], 
        ['Mavs', 15], 
        ['Cavs', 19],
        ['Wizards', 24],
        ['Cavs', 28],
        ['Nets', 40],
        ['Mavs', 24],
        ['Spurs', 13]] 
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|   Mavs|    15|
|   Cavs|    19|
|Wizards|    24|
|   Cavs|    28|
|   Nets|    40|
|   Mavs|    24|
|  Spurs|    13|
+-------+------+

We can use the following syntax to filter the DataFrame to only contain rows where the team column does not contain a pattern like “avs” somewhere in the string:

#filter DataFrame where team column does not contain pattern like 'avs'
df.filter(~df.team.like('%avs%')).show() 

+-------+------+
|   team|points|
+-------+------+
|   Nets|    33|
| Lakers|    12|
|Wizards|    24|
|   Nets|    40|
|  Spurs|    13|
+-------+------+

Notice that each of the rows in the resulting DataFrame do not contain a pattern like “avs” in the team column.

Note that we used the like function to find all strings in the team column that had a pattern like “avs” and then we used the ~ symbol to negate this function.

The end result is that we’re able to filter for only the rows in the DataFrame that do not have a pattern like “avs” in the team column.

Note: You can find the complete documentation for the PySpark like function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x