How can I filter a PySpark dataframe using the “contains” function?

The “contains” function in PySpark allows for filtering of a PySpark dataframe based on a specific string or pattern. This function searches for a given string or pattern within a column or columns of the dataframe and returns rows that contain that string or pattern. This is useful for data manipulation and analysis, as it allows for the selection of specific data that meets certain criteria. By using the “contains” function, users can efficiently filter large datasets and extract relevant information for further analysis.

PySpark: Filter Using “Contains”


You can use the following syntax to filter a PySpark DataFrame using a “contains” operator:

#filter DataFrame where team column contains 'avs'
df.filter(df.team.contains('avs')).show()

The following example shows how to use this syntax in practice.

Example: How to Filter Using “Contains” in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 14], 
        ['Nets', 22], 
        ['Nets', 31], 
        ['Cavs', 27], 
        ['Kings', 26], 
        ['Spurs', 40],
        ['Lakers', 23],
        ['Spurs', 17],] 
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+------+------+
|  team|points|
+------+------+
|  Mavs|    14|
|  Nets|    22|
|  Nets|    31|
|  Cavs|    27|
| Kings|    26|
| Spurs|    40|
|Lakers|    23|
| Spurs|    17|
+------+------+

We can use the following syntax to filter the DataFrame to only contain rows where the team column contains “avs” somewhere in the string:

#filter DataFrame where team column contains 'avs'
df.filter(df.team.contains('avs')).show()

+----+------+
|team|points|
+----+------+
|Mavs|    14|
|Cavs|    27|
+----+------+

Notice that each of the rows in the resulting DataFrame contain “avs” in the team column.

No other rows contained “avs” in the team column, which is why all other rows were filtered out of the DataFrame.

Note: The contains function is case-sensitive. For example, if you would have used “AVS” then the filter would not have returned any rows because no team name contained “AVS” in all uppercase letters.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x