How can I drop rows in PySpark based on multiple conditions?

In PySpark, it is possible to drop rows from a dataframe based on multiple conditions by using the .filter() function. This function allows for the use of logical operators such as “and”, “or”, and “not” to specify multiple conditions that must be met for a row to be dropped. Additionally, the .where() function can also be used to achieve the same result. By utilizing these functions, it is possible to efficiently drop rows in PySpark based on multiple conditions, providing greater flexibility in data manipulation and analysis.

PySpark: Drop Rows Based on Multiple Conditions


You can use the following syntax to drop rows from a PySpark DataFrame based on multiple conditions:

import pyspark.sql.functions as F

#drop rows where team is 'A' and points > 10
df_new = df.filter(~((F.col('team') == 'A') & (F.col('points') >10)))

This particular example will drop all rows from the DataFrame where the value in the team column is ‘A’ and the value in the points column is greater than 10.

The following example shows how to use this syntax in practice.

Example: Drop Rows Based on Multiple Conditions in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['B', 'Forward', 7]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

We can use the following syntax to drop all rows from the DataFrame where the value in the team column is ‘A’ and the value in the points column is greater than 10.

import pyspark.sql.functions as F

#drop rows where team is 'A' and points > 10
df_new = df.filter(~((F.col('team') == 'A') & (F.col('points') >10)))

#view new DataFrame
df_new.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|     8|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

Notice that all three rows in the DataFrame where the value in the team column is ‘A’ and the value in the points column is greater than 10 have been dropped.

Note that a row must meet both of these conditions to be dropped from the DataFrame.

Note #1: We used a single & symbol to filter based on two conditions but you can include more & symbols if you’d like to filter by even more conditions.

Note #2: You can find the complete documentation for the PySpark filter function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x