Use “IS NOT IN” in PySpark (With Example)


You can use the following syntax in PySpark to filter DataFrame rows where a value in a particular column is not in a particular list:

#define array of values
my_array = ['A', 'D', 'E']

#filter DataFrame to only contain rows where 'team' is not in my_array
df.filter(~df.team.isin(my_array)).show()

This particular example will filter the DataFrame to only contain rows where the value in the team column is not equal to A, D, or E.

The following example shows how to use this syntax in practice.

Example: How to Use “IS NOT IN” in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2],
        ['D', 'East', 14, 2],
        ['E', 'West', 25, 2]]
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
|   D|      East|    14|      2|
|   E|      West|    25|      2|
+----+----------+------+-------+

We can use the following syntax to filter the DataFrame to only show rows where the value in the team column is not equal to A, D, or E:

#define array of values
my_array = ['A', 'D', 'E']

#filter DataFrame to only contain rows where 'team' is not in my_array
df.filter(~df.team.isin(my_array)).show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

The resulting DataFrame only contains rows where the value in the team column is not equal to A, D, or E.

Note: The tilde ( ~ ) operator is used in PySpark to represent NOT.

By using this operator along with the isin function, we are able to filter the DataFrame to only contain rows where the value in a particular column is not in a list of values.

x