How can I filter a PySpark DataFrame by a Boolean column?


You can use the following methods to filter the rows of a PySpark DataFrame based on values in a Boolean column:

Method 1: Filter Based on Values in One Boolean Column

#filter for rows where value in 'all_star' column is True
df.filter(df.all_star==True).show()

Method 2: Filter Based on Values in Multiple Boolean Columns

#filter for rows where value in 'all_star' and 'starter' columns are both True
df.filter((df.all_star==True) & (df.starter==True)).show()

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 18, True, False], 
        ['B', 20, False, True], 
        ['C', 25, True, True], 
        ['D', 40, True, True], 
        ['E', 34, True, False], 
        ['F', 32, False, False],
        ['G', 19, False, False]] 
  
#define column names
columns = ['team', 'points', 'all_star', 'starter'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+------+--------+-------+
|team|points|all_star|starter|
+----+------+--------+-------+
|   A|    18|    true|  false|
|   B|    20|   false|   true|
|   C|    25|    true|   true|
|   D|    40|    true|   true|
|   E|    34|    true|  false|
|   F|    32|   false|  false|
|   G|    19|   false|  false|
+----+------+--------+-------+

Example 1: Filter Based on Values in One Boolean Column

We can use the following syntax to filter the DataFrame to only contain rows where the value in the all_star column is true:

#filter for rows where value in 'all_star' column is True
df.filter(df.all_star==True).show()

+----+------+--------+-------+
|team|points|all_star|starter|
+----+------+--------+-------+
|   A|    18|    true|  false|
|   C|    25|    true|   true|
|   D|    40|    true|   true|
|   E|    34|    true|  false|
+----+------+--------+-------+

Notice that each of the rows in the filtered DataFrame have a value of true in the all_star column.

Example 2: Filter Based on Values in Multiple Boolean Columns

We can use the following syntax to filter the DataFrame to only contain rows where the value in the all_star column is true and the value in the starter column is true:

#filter for rows where value in 'all_star' and 'starter' columns are both True
df.filter((df.all_star==True) & (df.starter==True)).show()

+----+------+--------+-------+
|team|points|all_star|starter|
+----+------+--------+-------+
|   C|    25|    true|   true|
|   D|    40|    true|   true|
+----+------+--------+-------+

Notice that each of the rows in the filtered DataFrame have a value of true in both the all_star and starter columns.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x