How can I filter rows in PySpark using the LIKE operator?

The LIKE operator in PySpark is used to filter rows in a DataFrame based on a specific pattern or substring. It allows for partial matching, making it a useful tool for data cleaning and analysis. To use the LIKE operator, the DataFrame must be converted into a temporary view using the .createOrReplaceTempView() method. Then, the LIKE operator is used within the SQL query to filter rows based on the specified pattern. This method is particularly useful for tasks such as finding all rows that contain a certain word or phrase. The LIKE operator provides a versatile and efficient way to filter data in PySpark.

PySpark: Filter Rows Using LIKE Operator


You can use the following syntax to filter a PySpark DataFrame using a LIKE operator:

df.filter(df.team.like('%avs%')).show()

This particular example filters the DataFrame to only show rows where the string in the team column has a pattern like “avs” somewhere in the string.

The following example shows how to use this syntax in practice.

Example: How to Filter Using LIKE Operator in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12], 
        ['Mavs', 15], 
        ['Cavs', 19],
        ['Wizards', 24],
        ['Cavs', 28],
        ['Nets', 40],
        ['Mavs', 24],
        ['Spurs', 13]] 
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|   Mavs|    15|
|   Cavs|    19|
|Wizards|    24|
|   Cavs|    28|
|   Nets|    40|
|   Mavs|    24|
|  Spurs|    13|
+-------+------+

We can use the following syntax to filter the DataFrame to only contain rows where the team column contains the pattern “avs” somewhere in the string:

#filter DataFrame where team column contains pattern like 'avs'
df.filter(df.team.like('%avs%')).show() 

+----+------+
|team|points|
+----+------+
|Mavs|    18|
|Mavs|    15|
|Cavs|    19|
|Cavs|    28|
|Mavs|    24|
+----+------+

Notice that each of the rows in the resulting DataFrame contain “avs” in the team column.

Note: You can find the complete documentation for the PySpark like function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x