How to Drop Rows that Contain a Specific Value in PySpark

Dropping rows that contain a specific value in PySpark is done by using the filter() function and providing a lambda expression that returns False if the value is found. This can be used to filter out rows with a specific value from a DataFrame, leaving only those rows that do not contain the value.


You can use the following methods to drop rows in a PySpark DataFrame that contain a specific value:

Method 1: Drop Rows with Specific Value

#drop rows where value in 'conference' column is equal to 'West'
df_new = df.filter(df.conference != 'West')

Method 2: Drop Rows with One of Several Specific Values

from pyspark.sql.functions import col

#drop rows where value in 'team' column is equal to 'A' or 'D'
df_new = df.filter(~col('team').isin(['A','D']))

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11], 
        ['A', 'East', 8], 
        ['A', 'East', 10], 
        ['B', 'West', 6], 
        ['B', 'West', 6], 
        ['C', 'East', 5],
        ['C', 'East', 15],
        ['C', 'West', 31],
        ['D', 'West', 24]] 
  
#define column names
columns = ['team', 'conference', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     6|
|   B|      West|     6|
|   C|      East|     5|
|   C|      East|    15|
|   C|      West|    31|
|   D|      West|    24|
+----+----------+------+

Example 1: Drop Rows with Specific Value in PySpark

We can use the following syntax to drop rows that contain the value ‘West’ in the conference column of the DataFrame:

#drop rows where value in 'conference' column is equal to 'West'
df_new = df.filter(df.conference != 'West')

#view new DataFrame
df_new.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   C|      East|     5|
|   C|      East|    15|
+----+----------+------+

Notice that all rows in the DataFrame that contained the value ‘West’ in the conference column have been dropped.

Example 2: Drop Rows with One of Several Specific Values in PySpark

We can use the following syntax to drop rows that contain the value ‘A’ or ‘D’ in the team column of the DataFrame:

from pyspark.sql.functions import col

#drop rows where value in 'team' column is equal to 'A' or 'D'
df_new = df.filter(~col('team').isin(['A','D']))

#view new DataFrame
df_new.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   B|      West|     6|
|   B|      West|     6|
|   C|      East|     5|
|   C|      East|    15|
|   C|      West|    31|
+----+----------+------+

Notice that all rows in the DataFrame that contained the value ‘A’ or ‘D’ in the team column have been dropped.

Note: You can find the complete documentation for the PySpark filter function .

x