Use “AND” Operator in PySpark (With Examples)


There are two common ways to filter a PySpark DataFrame by using an “AND” operator:

Method 1: Use “AND”

#filter DataFrame where points is greater than 5 and conference equals "East"
df.filter('points>5 and conference=="East"').show()

Method 2: Use & Symbol

#filter DataFrame where points is greater than 5 and conference equals "East"
df.filter((df.points>5) & (df.conference=="East")).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Filter DataFrame Using “AND”

We can use the following syntax with the filter function and the word and to filter the DataFrame to only contain rows where the value in the points column is greater than 5 and the value in the conference column is equal to East:

#filter DataFrame where points is greater than 5 and conference equals "East"
df.filter('points>5 and conference=="East"').show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
+----+----------+------+-------+

Notice that each of the rows in the resulting DataFrame meet both of the following conditions:

  • The value in the points column is greater than 5
  • The value in the conference column is equal to “East”

Also note that in this example we only used one and operator but you can combine as many and operators as you’d like inside the filter function to filter using even more conditions.

Example 2: Filter DataFrame Using & Symbol

We can use the following syntax with the filter function and the & symbol to filter the DataFrame to only contain rows where the value in the points column is greater than 5 and the value in the conference column is equal to East:

#filter DataFrame where points is greater than 5 and conference equals "East"
df.filter((df.points>5) & (df.conference=="East")).show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
+----+----------+------+-------+

Notice that each of the rows in the resulting DataFrame meet both of the following conditions:

  • The value in the points column is greater than 5
  • The value in the conference column is equal to “East”

Also note that this DataFrame matches the DataFrame from the previous example.

PySpark: How to Use “OR” Operator

x