How can I use case-insensitive “contains” in PySpark for my specific use case?

In order to use case-insensitive “contains” in PySpark for a specific use case, the following steps can be followed:

1. Define the column or string where the search will be performed.
2. Use the “lower” function to convert the column or string to lowercase.
3. Apply the “contains” function to the lowercase column or string, with the search term also converted to lowercase.
4. This will allow for a case-insensitive search and will return a Boolean value indicating if the search term is present in the column or string.

PySpark: Use Case-Insensitive “Contains”


By default, the contains function in PySpark is case-sensitive.

However, you can use the following syntax to use a case-insensitive “contains” to filter a DataFrame where rows contain a specific string, regardless of case:

from pyspark.sql.functions import upper

#perform case-insensitive filter for rows that contain 'AVS' in team column
df.filter(upper(df.team).contains('AVS')).show()

The following example shows how to use this syntax in practice.

Example: How to Use Case-Insensitive “Contains” in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 14], 
        ['Nets', 22], 
        ['Nets', 31], 
        ['Cavs', 27], 
        ['CAVS', 26], 
        ['Spurs', 40],
        ['mavs', 23],
        ['MAVS', 17],] 
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-----+------+
| team|points|
+-----+------+
| Mavs|    14|
| Nets|    22|
| Nets|    31|
| Cavs|    27|
| CAVS|    26|
|Spurs|    40|
| mavs|    23|
| MAVS|    17|
+-----+------+

Suppose we use the following syntax to filter the DataFrame to only contain rows where the team column contains “AVS” somewhere in the string:

#filter DataFrame where team column contains 'AVS'
df.filter(df.team.contains('AVS')).show()

+----+------+
|team|points|
+----+------+
|CAVS|    26|
|MAVS|    17|
+----+------+

Notice that this syntax performs a case-sensitive search by default and only returns the rows where the team column contains “AVS” in all uppercase.

However, suppose we would like to perform a case-insensitive search and return all rows where the team column contains “AVS”, regardless of case.

We can use the following syntax to do so:

from pyspark.sql.functions import upper

#perform case-insensitive filter for rows that contain 'AVS' in team column
df.filter(upper(df.team).contains('AVS')).show()

+----+------+
|team|points|
+----+------+
|Mavs|    14|
|Cavs|    27|
|CAVS|    26|
|mavs|    23|
|MAVS|    17|
+----+------+

Notice that this syntax performs a case-insensitive search and returns all rows where the team column contains “AVS”, regardless of case.

Note: We used the upper function to first convert all strings in the team column to uppercase and then searched for “AVS”, which is the equivalent of using a case-sensitive “contains” filter.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x