How can I use “IS NOT IN” in PySpark, with an example?

“IS NOT IN” in PySpark is a logical operator used to filter out data that does not match a specified list of values. This operator is useful in data manipulation and analysis tasks, where certain data needs to be excluded from the results. For example, if we have a dataset of customer purchases and want to find all customers who have not bought a specific item, we can use “IS NOT IN” to filter out those customers who have purchased the item from the dataset. This allows for efficient and targeted data analysis.

Use “IS NOT IN” in PySpark (With Example)


You can use the following syntax in PySpark to filter DataFrame rows where a value in a particular column is not in a particular list:

#define array of values
my_array = ['A', 'D', 'E']

#filter DataFrame to only contain rows where 'team' is not in my_array
df.filter(~df.team.isin(my_array)).show()

This particular example will filter the DataFrame to only contain rows where the value in the team column is not equal to A, D, or E.

The following example shows how to use this syntax in practice.

Example: How to Use “IS NOT IN” in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2],
        ['D', 'East', 14, 2],
        ['E', 'West', 25, 2]]
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
|   D|      East|    14|      2|
|   E|      West|    25|      2|
+----+----------+------+-------+

We can use the following syntax to filter the DataFrame to only show rows where the value in the team column is not equal to A, D, or E:

#define array of values
my_array = ['A', 'D', 'E']

#filter DataFrame to only contain rows where 'team' is not in my_array
df.filter(~df.team.isin(my_array)).show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

The resulting DataFrame only contains rows where the value in the team column is not equal to A, D, or E.

Note: The tilde ( ~ ) operator is used in PySpark to represent NOT.

By using this operator along with the isin function, we are able to filter the DataFrame to only contain rows where the value in a particular column is not in a list of values.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x