What is the Use of the Case-Insensitive rlike Function in PySpark?


You can use the rlike function in PySpark to search for regex matches in a string.

By default, the rlike function is case-sensitive but you can use the syntax (?i) to perform a case-insensitive search.

For example, you can use the following syntax to filter the rows in a DataFrame where the team column contains the string ‘avs’, regardless of case:

df.filter(df.team.rlike('(?i)avs')).show()

The following example shows how to use this syntax in practice.

Example: How to Use Case-Insensitive rlike in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12], 
        ['Kings', 15], 
        ['CAVS', 19],
        ['Wizards', 24],
        ['Cavs', 28],
        ['Jazz', 40],
        ['MAVS', 24],
        ['Lakers', 13]]
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|  Kings|    15|
|   CAVS|    19|
|Wizards|    24|
|   Cavs|    28|
|   Jazz|    40|
|   MAVS|    24|
| Lakers|    13|
+-------+------+

Suppose we use the rlike function in the following manner to filter for rows where the team column contains ‘avs’ somehwhere in the string:

#filter for rows where team column contains 'avs'
df.filter(df.team.rlike('avs')).show()

+----+------+
|team|points|
+----+------+
|Mavs|    18|
|Cavs|    28|
+----+------+

Notice that each of the rows with ‘avs’ in the team column are returned, but the rows with ‘AVS’ are not returned because they don’t match based on the case.

To instead perform a case-insensitive search, we can include (?i) in the rlike function as follows:

#filter for rows where team column contains 'avs', regardless of case
df.filter(df.team.rlike('(?i)avs')).show()

+----+------+
|team|points|
+----+------+
|Mavs|    18|
|CAVS|    19|
|Cavs|    28|
|MAVS|    24|
+----+------+

Notice that all rows with ‘avs’ (regardless of case) in the team column are returned this time.

Note: You can find the complete documentation the PySPark rlike function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x