How can I use the “AND” operator in PySpark? Can you provide some examples?

The “AND” operator in PySpark is used to logically combine two or more conditions in a query. It allows for the selection of data that meets multiple criteria simultaneously. This operator is commonly used in filtering data and creating complex conditional statements. An example of using the “AND” operator in PySpark would be filtering a dataset to only include entries where both conditions of “age greater than 18” and “gender equals female” are met. Another example would be using the “AND” operator to create a nested conditional statement, such as filtering data to include entries where “income is greater than 50,000” and either “education level is ‘Bachelor’s degree'” or “occupation is ‘Manager'”. Overall, the “AND” operator in PySpark is a powerful tool for refining data queries and can be used in various ways to meet specific data analysis needs.

Use “AND” Operator in PySpark (With Examples)


There are two common ways to filter a PySpark DataFrame by using an “AND” operator:

Method 1: Use “AND”

#filter DataFrame where points is greater than 5 and conference equals "East"
df.filter('points>5 and conference=="East"').show()

Method 2: Use & Symbol

#filter DataFrame where points is greater than 5 and conference equals "East"
df.filter((df.points>5) & (df.conference=="East")).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Filter DataFrame Using “AND”

We can use the following syntax with the filter function and the word and to filter the DataFrame to only contain rows where the value in the points column is greater than 5 and the value in the conference column is equal to East:

#filter DataFrame where points is greater than 5 and conference equals "East"
df.filter('points>5 and conference=="East"').show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
+----+----------+------+-------+

Notice that each of the rows in the resulting DataFrame meet both of the following conditions:

  • The value in the points column is greater than 5
  • The value in the conference column is equal to “East”

Also note that in this example we only used one and operator but you can combine as many and operators as you’d like inside the filter function to filter using even more conditions.

Example 2: Filter DataFrame Using & Symbol

We can use the following syntax with the filter function and the & symbol to filter the DataFrame to only contain rows where the value in the points column is greater than 5 and the value in the conference column is equal to East:

#filter DataFrame where points is greater than 5 and conference equals "East"
df.filter((df.points>5) & (df.conference=="East")).show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
+----+----------+------+-------+

Notice that each of the rows in the resulting DataFrame meet both of the following conditions:

  • The value in the points column is greater than 5
  • The value in the conference column is equal to “East”

Also note that this DataFrame matches the DataFrame from the previous example.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Use “OR” Operator

x