PySpark: Select Rows Based on Column Values


You can use the following methods to select rows based on column values in a PySpark DataFrame:

Method 1: Select Rows where Column is Equal to Specific Value

#select rows where 'team' column is equal to 'B'
df.where(df.team=='B').show() 

Method 2: Select Rows where Column Value is in List of Values

#select rows where 'team' column is equal to 'A' or 'B'
df.filter(df.team.isin('A','B')).show()

Method 3: Select Rows Based on Multiple Column Conditions

#select rows where 'team' column is 'A' and 'points' column is greater than 9
df.where((df.team=='A') & (df.points>9)).show() 

The following examples show how to use each of these methods in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11], 
        ['A', 'East', 8], 
        ['A', 'East', 10], 
        ['B', 'West', 6], 
        ['B', 'West', 6], 
        ['C', 'East', 5]] 
  
#define column names
columns = ['team', 'conference', 'points'] 
  
#create DataFrame using data and column names
df = spark.createDataFrame(data, columns) 
  
#view DataFrame
df.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     6|
|   B|      West|     6|
|   C|      East|     5|
+----+----------+------+

Example 1: Select Rows where Column is Equal to Specific Value

We can use the following syntax to select only the rows where the team column is equal to B:

#select rows where 'team' column is equal to 'B'
df.where(df.team=='B').show() 

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   B|      West|     6|
|   B|      West|     6|
+----+----------+------+

Notice that only the rows where the team column is equal to B are returned.

Example 2: Select Rows where Column Value is in List of Values

We can use the following syntax to select only the rows where the team column is equal to A or B:

#select rows where 'team' column is equal to 'A' or 'B'
df.filter(df.team.isin('A','B')).show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     6|
|   B|      West|     6|
+----+----------+------+

Notice that only the rows where the team column is equal to either A or B are returned.

Example 3: Select Rows Based on Multiple Column Conditions

We can use the following syntax to select only the rows where the team column is equal to A and the points column is greater than 9:

#select rows where 'team' column is 'A' and 'points' column is greater than 9
df.where((df.team=='A') & (df.points>9)).show() 

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|    10|
+----+----------+------+

Notice that only the two rows that met both conditions are returned.

x