How can I select rows in PySpark based on column values?

PySpark is a powerful tool for data manipulation and analysis. One of its key functionalities is the ability to select specific rows from a dataset based on column values. This can be achieved using the filter() function, which allows users to specify a condition or criteria for selecting rows. The filter() function is applied to a PySpark dataframe, and the resulting dataframe will only contain rows that meet the specified criteria. This allows for efficient and targeted data selection, making PySpark a valuable tool for handling large datasets.

PySpark: Select Rows Based on Column Values


You can use the following methods to select rows based on column values in a PySpark DataFrame:

Method 1: Select Rows where Column is Equal to Specific Value

#select rows where 'team' column is equal to 'B'
df.where(df.team=='B').show() 

Method 2: Select Rows where Column Value is in List of Values

#select rows where 'team' column is equal to 'A' or 'B'
df.filter(df.team.isin('A','B')).show()

Method 3: Select Rows Based on Multiple Column Conditions

#select rows where 'team' column is 'A' and 'points' column is greater than 9
df.where((df.team=='A') & (df.points>9)).show() 

The following examples show how to use each of these methods in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11], 
        ['A', 'East', 8], 
        ['A', 'East', 10], 
        ['B', 'West', 6], 
        ['B', 'West', 6], 
        ['C', 'East', 5]] 
  
#define column names
columns = ['team', 'conference', 'points'] 
  
#create DataFrame using data and column names
df = spark.createDataFrame(data, columns) 
  
#view DataFrame
df.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     6|
|   B|      West|     6|
|   C|      East|     5|
+----+----------+------+

Example 1: Select Rows where Column is Equal to Specific Value

We can use the following syntax to select only the rows where the team column is equal to B:

#select rows where 'team' column is equal to 'B'
df.where(df.team=='B').show() 

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   B|      West|     6|
|   B|      West|     6|
+----+----------+------+

Notice that only the rows where the team column is equal to B are returned.

Example 2: Select Rows where Column Value is in List of Values

We can use the following syntax to select only the rows where the team column is equal to A or B:

#select rows where 'team' column is equal to 'A' or 'B'
df.filter(df.team.isin('A','B')).show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     6|
|   B|      West|     6|
+----+----------+------+

Notice that only the rows where the team column is equal to either A or B are returned.

Example 3: Select Rows Based on Multiple Column Conditions

We can use the following syntax to select only the rows where the team column is equal to Aand the points column is greater than 9:

#select rows where 'team' column is 'A' and 'points' column is greater than 9
df.where((df.team=='A') & (df.points>9)).show() 

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|    10|
+----+----------+------+

Notice that only the two rows that met both conditions are returned.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x