How can I use case-insensitive rlike in PySpark for a specific use case?

Case-insensitive rlike in PySpark is a useful tool for performing pattern matching in a case-insensitive manner. It allows for the identification of strings or patterns within a larger text or dataset, regardless of the case used in the search. This feature is particularly helpful in scenarios where the text or data may contain a mix of upper and lower case letters, and precision in matching is not necessary. By using case-insensitive rlike in PySpark, users can easily filter and extract relevant information from their datasets, making it a valuable tool for data analysis and manipulation.

PySpark: Use Case-Insensitive rlike


You can use the rlike function in PySpark to search for regex matches in a string.

By default, the rlike function is case-sensitive but you can use the syntax (?i) to perform a case-insensitive search.

For example, you can use the following syntax to filter the rows in a DataFrame where the team column contains the string ‘avs’, regardless of case:

df.filter(df.team.rlike('(?i)avs')).show()

The following example shows how to use this syntax in practice.

Example: How to Use Case-Insensitive rlike in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12], 
        ['Kings', 15], 
        ['CAVS', 19],
        ['Wizards', 24],
        ['Cavs', 28],
        ['Jazz', 40],
        ['MAVS', 24],
        ['Lakers', 13]]
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|  Kings|    15|
|   CAVS|    19|
|Wizards|    24|
|   Cavs|    28|
|   Jazz|    40|
|   MAVS|    24|
| Lakers|    13|
+-------+------+

Suppose we use the rlike function in the following manner to filter for rows where the team column contains ‘avs’ somehwhere in the string:

#filter for rows where team column contains 'avs'
df.filter(df.team.rlike('avs')).show()

+----+------+
|team|points|
+----+------+
|Mavs|    18|
|Cavs|    28|
+----+------+

Notice that each of the rows with ‘avs’ in the team column are returned, but the rows with ‘AVS’ are not returned because they don’t match based on the case.

To instead perform a case-insensitive search, we can include (?i) in the rlike function as follows:

#filter for rows where team column contains 'avs', regardless of case
df.filter(df.team.rlike('(?i)avs')).show()

+----+------+
|team|points|
+----+------+
|Mavs|    18|
|CAVS|    19|
|Cavs|    28|
|MAVS|    24|
+----+------+

Notice that all rows with ‘avs’ (regardless of case) in the team column are returned this time.

Note: You can find the complete documentation the PySPark rlike function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x