How can I use “IS NOT IN” in PySpark, with an example?

How to Filter Data Using “IS NOT IN” in PySpark

An Introduction to Advanced Data Filtering in PySpark

In the expansive landscape of Big Data processing, PySpark stands out as a premier tool for managing and analyzing massive datasets with high efficiency. One of the most fundamental tasks any data engineer or scientist performs is filtering data to narrow down results to a specific subset of interest. While selecting data that matches certain criteria is straightforward, there are many scenarios where you need to do the opposite: exclude data that matches a specific list of values. This is where the concept of the “IS NOT IN” logical operation becomes essential for cleaning, transforming, and preparing data for high-level data analysis.

The “IS NOT IN” operation in Apache Spark is not a single keyword but rather a combination of functional methods and logical operators. By leveraging the DataFrame.filter() method in conjunction with the isin() function and the bitwise NOT operator, users can precisely control which rows are omitted from their resulting DataFrame. This capability is particularly useful when dealing with categorical variables where a few specific categories represent noise, outliers, or irrelevant information that could skew the results of a machine learning model or a statistical report.

Understanding how to implement this logic is vital for maintaining clean data pipelines. Whether you are filtering out blacklisted IP addresses in a cybersecurity log, removing canceled orders from a sales report, or excluding specific experimental groups in a scientific study, mastering the “IS NOT IN” syntax ensures your Python code remains readable, performant, and accurate. In the following sections, we will delve into the technical mechanics of this operation, explore the underlying Boolean algebra, and provide a comprehensive practical example to solidify your understanding of these core concepts.

The Mechanics of the isin Method and the Tilde Operator

To effectively perform an “IS NOT IN” operation, one must first understand the isin() method provided by the PySpark Column API. The isin() function evaluates each row in a specified column and returns a Boolean value: true if the value exists within the provided list, and false otherwise. While this is perfect for inclusion, the requirement for exclusion necessitates the use of a negation operator. In PySpark, negation is handled by the tilde (~) symbol, which acts as the logical NOT operator. When you prefix a condition with the tilde, you effectively flip the Boolean results, turning every true into a false and vice versa.

The synergy between these two components allows for a highly flexible filtering mechanism. By applying the tilde to the result of an isin() check, you instruct the Spark Optimizer to retain only those rows where the column value is absent from the specified array. This approach is superior to chaining multiple “not equal to” (!=) conditions, especially as the list of excluded values grows. Chaining multiple conditions manually can lead to verbose and error-prone code, whereas isin() accepts a standard Python list, making the logic much easier to maintain and update as project requirements evolve.

From a performance perspective, using isin() is highly efficient within the distributed computing environment of Spark. Because Spark utilizes lazy evaluation, the filter operation is not executed immediately. Instead, it is added to the logical plan, which the Catalyst Optimizer then refines to ensure the data is filtered as early as possible in the execution process. This minimizes the amount of data shuffled across the network, leading to faster execution times and lower resource consumption on your cluster.

Syntax and Implementation of Exclusion Logic

The standard syntax for implementing exclusion logic in a PySpark script is concise and follows a functional programming paradigm. To begin, you must have a SparkSession active and a DataFrame loaded into memory. The following syntax template demonstrates the most common way to filter out rows based on a list of prohibited values:

#define array of values
my_array = ['A', 'D', 'E']

#filter DataFrame to only contain rows where 'team' is not in my_array
df.filter(~df.team.isin(my_array)).show()

In this snippet, we first define a list containing the values we wish to exclude. The df.filter() method is then called, taking the negated isin() condition as its argument. It is important to note that the column can be referenced using either the dot notation (df.team) or the bracket notation (df[‘team’]). Both are valid, though dot notation is often preferred for its brevity. The tilde (~) must be placed directly before the column expression to negate the entire isin() operation.

This specific implementation will result in a new DataFrame that contains only the records where the team column does not contain ‘A’, ‘D’, or ‘E’. This logic is vital for data cleansing pipelines where you might receive a master dataset but only need to process specific subsets. By using this clean syntax, you ensure that other developers reading your source code can immediately identify the filtering criteria and the intention behind the data exclusion.

Practical Example: Preparing the PySpark Environment

To see the “IS NOT IN” logic in action, we must first construct a representative dataset. In this example, we will simulate a dataset representing basketball statistics, including team names, regional conferences, points scored, and assists recorded. This type of structured data is common in Business Intelligence applications. We begin by initializing a SparkSession, which serves as the entry point for all Spark functionality.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2],
        ['D', 'East', 14, 2],
        ['E', 'West', 25, 2]]
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
|   D|      East|    14|      2|
|   E|      West|    25|      2|
+----+----------+------+-------+

The code above utilizes the createDataFrame method to transform a raw Python list of lists into a distributed Spark DataFrame. This process involves defining the schema via a list of strings that represent the column headers. Once the DataFrame is created, the show() method displays the data in a tabular format, allowing us to verify the initial state of our information before any filtering takes place.

At this stage, the dataset contains eight rows across five different teams (‘A’ through ‘E’). In many real-world scenarios, a dataset like this could contain millions of rows, making manual inspection impossible. Therefore, relying on programmatic filtering methods is the only viable way to perform accurate data wrangling. Our goal in the next step will be to remove all entries associated with teams A, D, and E to focus our analysis on teams B and C.

Executing the Filter Operation and Analyzing Results

Now that our environment is set up and the data is loaded, we can apply the negation filter. We will define an exclusion array and use the filter method to strip away the unwanted rows. This operation effectively demonstrates how the ~ operator and isin() work together to produce a refined dataset based on exclusion criteria rather than inclusion.

#define array of values
my_array = ['A', 'D', 'E']

#filter DataFrame to only contain rows where 'team' is not in my_array
df.filter(~df.team.isin(my_array)).show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

The output clearly shows that all records belonging to team ‘A’, team ‘D’, and team ‘E’ have been successfully removed from the view. The resulting DataFrame contains only the rows for teams ‘B’ and ‘C’. This is a powerful demonstration of how a single line of code can perform complex data filtering tasks that would otherwise require significant manual effort or complex SQL queries.

It is important to remember that the filter() method does not modify the original DataFrame. Due to the immutability of DataFrames in Spark, the operation returns a new DataFrame. If you need to use this filtered data in subsequent steps of your pipeline, you should assign the result to a new variable name, such as filtered_df. This immutable nature of Spark objects is a key feature that helps prevent side effects and makes debugging complex data transformations much more manageable.

Comparing DataFrame API vs. Spark SQL for Filtering

While the DataFrame API is highly popular among Python developers, PySpark also allows users to perform filtering using Spark SQL. In Spark SQL, the “IS NOT IN” logic is even more explicit and mirrors standard relational database syntax. For users coming from a heavy SQL background, using the NOT IN clause within a spark.sql() call might feel more intuitive than the DataFrame API’s functional approach.

For example, one could achieve the same result by registering the DataFrame as a temporary view and then executing a SQL query: SELECT * FROM basketball_table WHERE team NOT IN (‘A’, ‘D’, ‘E’). Both methods are functionally equivalent and are handled by the same Catalyst Optimizer, meaning there is usually no performance penalty for choosing one over the other. The choice between the DataFrame API and Spark SQL often comes down to personal preference or the specific coding standards of a development team.

However, the DataFrame API offers better type safety and integration with Python’s native features. For instance, passing a Python list directly into the isin() method is much simpler than dynamically constructing a SQL string with a variable number of parameters. When building dynamic data pipelines where the exclusion list is generated at runtime, the DataFrame API typically provides a cleaner and more robust implementation compared to string manipulation in SQL.

Optimizing Filter Operations for Scalability

As your datasets grow from thousands to billions of rows, the way you implement filters can impact the overall performance of your Spark application. One key consideration when using isin() with a very large list of values is the potential for computation overhead. If the exclusion list contains thousands of items, it might be more efficient to store that list in a separate DataFrame and perform a left_anti join. An anti-join returns only the rows from the left table that do not have a match in the right table, which is functionally identical to an “IS NOT IN” filter.

Another optimization technique involves partitioning. If you frequently filter by a specific column, such as “team” or “date,” ensuring that your data is partitioned by that column can significantly speed up filter operations. Spark can use partition pruning to skip entire directories of data that it knows do not meet the filter criteria, drastically reducing I/O overhead. This is particularly effective when reading data from formats like Apache Parquet or ORC, which store metadata about the values contained within each file.

Lastly, always consider the placement of your filters. In a complex Spark job with multiple joins and aggregations, the “IS NOT IN” filter should be applied as early as possible. This “predicate pushdown” ensures that the volume of data being processed in subsequent stages is minimized. While the Spark optimizer handles much of this automatically, being mindful of your filter placement can help you write more efficient code and avoid common pitfalls associated with data shuffling and resource exhaustion.

Best Practices and Common Pitfalls

When implementing the “IS NOT IN” logic in PySpark, there are several best practices to keep in mind to ensure your code is both robust and performant. First, always be cautious with NULL values. In many SQL environments, if a column contains a null, the result of a NOT IN operation can be unexpected. In PySpark, isin() typically handles nulls by returning null (which is treated as false in a filter), but it is often safer to explicitly handle nulls using the isNull() or dropna() methods before applying complex filters.

Additionally, maintain the readability of your code by giving descriptive names to your exclusion lists. Instead of naming a variable my_array, use a name like excluded_teams_list or blacklisted_user_ids. This provides context to anyone reviewing the code and makes the business logic behind the filter immediately apparent. Furthermore, avoid linking or hardcoding large lists directly in your script; instead, consider loading these values from a configuration file or a database table to keep your code flexible.

Finally, always verify the results of your filter with a count or a sample. Using df.count() before and after the filter is a quick way to ensure the operation removed the expected number of rows. By following these guidelines and mastering the use of the tilde operator and the isin() method, you will be well-equipped to handle even the most challenging data filtering tasks in the PySpark ecosystem. This knowledge forms the foundation of effective ETL (Extract, Transform, Load) processes and high-quality data engineering.

Cite this article

stats writer (2026). How to Filter Data Using “IS NOT IN” in PySpark. PSYCHOLOGICAL SCALES. Retrieved from https://scales.arabpsychology.com/stats/how-can-i-use-is-not-in-in-pyspark-with-an-example/

stats writer. "How to Filter Data Using “IS NOT IN” in PySpark." PSYCHOLOGICAL SCALES, 11 Feb. 2026, https://scales.arabpsychology.com/stats/how-can-i-use-is-not-in-in-pyspark-with-an-example/.

stats writer. "How to Filter Data Using “IS NOT IN” in PySpark." PSYCHOLOGICAL SCALES, 2026. https://scales.arabpsychology.com/stats/how-can-i-use-is-not-in-in-pyspark-with-an-example/.

stats writer (2026) 'How to Filter Data Using “IS NOT IN” in PySpark', PSYCHOLOGICAL SCALES. Available at: https://scales.arabpsychology.com/stats/how-can-i-use-is-not-in-in-pyspark-with-an-example/.

[1] stats writer, "How to Filter Data Using “IS NOT IN” in PySpark," PSYCHOLOGICAL SCALES, vol. X, no. Y, ص Z-Z, February, 2026.

stats writer. How to Filter Data Using “IS NOT IN” in PySpark. PSYCHOLOGICAL SCALES. 2026;vol(issue):pages.

Download Post (.PDF)
Slide Up
x
PDF
Scroll to Top