PySpark: Use Case-Insensitive “Contains”


By default, the contains function in PySpark is case-sensitive.

However, you can use the following syntax to use a case-insensitive “contains” to filter a DataFrame where rows contain a specific string, regardless of case:

from pyspark.sql.functions import upper

#perform case-insensitive filter for rows that contain 'AVS' in team column
df.filter(upper(df.team).contains('AVS')).show()

The following example shows how to use this syntax in practice.

Example: How to Use Case-Insensitive “Contains” in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 14], 
        ['Nets', 22], 
        ['Nets', 31], 
        ['Cavs', 27], 
        ['CAVS', 26], 
        ['Spurs', 40],
        ['mavs', 23],
        ['MAVS', 17],] 
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-----+------+
| team|points|
+-----+------+
| Mavs|    14|
| Nets|    22|
| Nets|    31|
| Cavs|    27|
| CAVS|    26|
|Spurs|    40|
| mavs|    23|
| MAVS|    17|
+-----+------+

Suppose we use the following syntax to filter the DataFrame to only contain rows where the team column contains “AVS” somewhere in the string:

#filter DataFrame where team column contains 'AVS'
df.filter(df.team.contains('AVS')).show()

+----+------+
|team|points|
+----+------+
|CAVS|    26|
|MAVS|    17|
+----+------+

Notice that this syntax performs a case-sensitive search by default and only returns the rows where the team column contains “AVS” in all uppercase.

However, suppose we would like to perform a case-insensitive search and return all rows where the team column contains “AVS”, regardless of case.

We can use the following syntax to do so:

from pyspark.sql.functions import upper

#perform case-insensitive filter for rows that contain 'AVS' in team column
df.filter(upper(df.team).contains('AVS')).show()

+----+------+
|team|points|
+----+------+
|Mavs|    14|
|Cavs|    27|
|CAVS|    26|
|mavs|    23|
|MAVS|    17|
+----+------+

Notice that this syntax performs a case-insensitive search and returns all rows where the team column contains “AVS”, regardless of case.

Note: We used the upper function to first convert all strings in the team column to uppercase and then searched for “AVS”, which is the equivalent of using a case-sensitive “contains” filter.

x