How can I use the “OR” operator in PySpark with examples?

The “OR” operator in PySpark is a logical operator that allows for the evaluation of multiple conditions at once. It returns a boolean value of “True” if at least one of the conditions is satisfied, and “False” if none of the conditions are satisfied. This operator is commonly used in filtering and conditional statements to perform more complex operations.

For example, in PySpark, we can use the “OR” operator to filter a dataframe based on multiple conditions, such as selecting all rows where the value in column A is “True” or the value in column B is “True”. This can be written as:

df.filter((df[‘A’] == True) | (df[‘B’] == True))

Another example is using the “OR” operator in a conditional statement to perform different actions based on whether one or more conditions are met. This can be written as:

if (condition1 == True) or (condition2 == True):
# do something
else:
# do something else

In summary, the “OR” operator in PySpark is a powerful tool for evaluating multiple conditions and can be used in various scenarios to streamline and simplify code.

Use “OR” Operator in PySpark (With Examples)


There are two common ways to filter a PySpark DataFrame by using an “OR” operator:

Method 1: Use “OR”

#filter DataFrame where points is greater than 9 or team equals "B"
df.filter('points>9 or team=="B"').show()

Method 2: Use | Symbol

#filter DataFrame where points is greater than 9 or team equals "B"
df.filter((df.points>9) | (df.team=="B")).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Filter DataFrame Using “OR”

We can use the following syntax with the filter function and the word or to filter the DataFrame to only contain rows where the value in the points column is greater than 9 or the value in the team column is equal to B:

#filter DataFrame where points is greater than 9 or team equals "B"
df.filter('points>9 or team=="B"').show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
+----+----------+------+-------+

 Notice that each of the rows in the resulting DataFrame meet at least one of the following conditions:

  • The value in the points column is greater than 9
  • The value in the team column is equal to “B”

Also note that in this example we only used one or operator but you can combine as many or operators as you’d like inside the filter function to filter using even more conditions.

Example 2: Filter DataFrame Using | Symbol

We can use the following syntax with the filter function and the | symbol to filter the DataFrame to only contain rows where the value in the points column is greater than 9 or the value in the team column is equal to B:

#filter DataFrame where points is greater than 9 or team equals "B"
df.filter((df.points>9) | (df.team=="B")).show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
+----+----------+------+-------+

Notice that each of the rows in the resulting DataFrame meet at least one of the following conditions:

  • The value in the points column is greater than 9
  • The value in the team column is equal to “B”

Also note that this DataFrame matches the DataFrame from the previous example.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x