How can I count the values in a column in PySpark, while also applying a condition on the column’s values?

To count the number of values in a column in PySpark, while also applying a condition on the column’s values, you can use the “filter” function. This function allows you to filter the data based on a specific condition and then count the remaining values in the column. Additionally, you can use the “agg” function to apply an aggregate operation, such as “count,” on the filtered data to get the final count. This approach allows you to efficiently count the values in a column while also considering a specific condition.

PySpark: Count Values in Column with Condition


You can use the following methods to count the number of values in a column of a PySpark DataFrame that meet a specific conditon:

Method 1: Count Values that Meet One Condition

#count values in 'team' column that are equal to 'C'
df.filter(df.team == 'C').count()

Method 2: Count Values that Meet One of Several Conditions

from pyspark.sql.functions importcol#count values in 'team' column that are equal to 'A' or 'D'
df.filter(col('team').isin(['A','D'])).count()

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11], 
        ['A', 'East', 8], 
        ['A', 'East', 10], 
        ['B', 'West', 6], 
        ['B', 'West', 6], 
        ['C', 'East', 5],
        ['C', 'East', 15],
        ['C', 'West', 31],
        ['D', 'West', 24]] 
  
#define column names
columns = ['team', 'conference', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     6|
|   B|      West|     6|
|   C|      East|     5|
|   C|      East|    15|
|   C|      West|    31|
|   D|      West|    24|
+----+----------+------+

Example 1: Count Values that Meet One Condition

We can use the following syntax to count the number of values in the team column that are equal to C:

#count values in 'team' column that are equal to 'C'
df.filter(df.team == 'C').count()

3

We can see that a total of 3 values in the team column are equal to C.

Example 2: Count Values that Meet One of Several Conditions

We can use the following syntax to count the number of values in the team column that are equal to either A or D:

from pyspark.sql.functions importcol#count values in 'team' column that are equal to 'A' or 'D'
df.filter(col('team').isin(['A','D'])).count()

4

We can see that a total of 4 values in the team column are equal to either A or D.

Note: You can find the complete documentation for the PySpark filter function .

Additional Resources

x