Table of Contents

In PySpark, rows can be filtered based on values in a list by using the “isin” function. This function takes in a list of values and returns a boolean column that indicates whether each value in the column is contained within the list. This boolean column can then be used to filter the original dataframe and only keep the rows that have values present in the list. This method is useful for selecting specific subsets of data based on a predefined list of values.

You can use the following syntax to filter a PySpark DataFrame for rows that contain a value from a specific list:

#specify values to filter for
my_list = ['Mavs', 'Kings', 'Spurs']

#filter for rows where team is in list
df.filter(df.team.isin(my_list)).show()

This particular example filters the DataFrame to only contain rows where the value in the team column is equal to one of the values in the list that we specified.

The following example shows how to use this syntax in practice.

Example: How to Filter Rows Based on Values in List in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12], 
        ['Mavs', 15], 
        ['Kings', 19],
        ['Wizards', 24],
        ['Magic', 28],
        ['Nets', 40],
        ['Mavs', 24],
        ['Spurs', 13]]
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|   Mavs|    15|
|  Kings|    19|
|Wizards|    24|
|  Magic|    28|
|   Nets|    40|
|   Mavs|    24|
|  Spurs|    13|
+-------+------+

We can use the following syntax to filter the DataFrame for rows where the team column is equal to a team name in a specific list:

#specify values to filter for
my_list = ['Mavs', 'Kings', 'Spurs']

#filter for rows where team is in list
df.filter(df.team.isin(my_list)).show()

+-----+------+
| team|points|
+-----+------+
| Mavs|    18|
| Mavs|    15|
|Kings|    19|
| Mavs|    24|
|Spurs|    13|
+-----+------+

Notice that each of the rows in the filtered DataFrame have a team value equal to either Mavs, Kings or Spurs, which are the three team names that we specified in our list.

Note #1: The isin function is case-sensitive.

Note #2: You can find the complete documentation for the PySpark isin function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How can I filter rows in PySpark based on values in a list?

Example: How to Filter Rows Based on Values in List in PySpark

Additional Resources

Requst a

Scale

Example: How to Filter Rows Based on Values in List in PySpark

Additional Resources

Related terms:

Requst a

Scale