How can I count the number of occurrences of a specific value in a PySpark dataframe?

The process of counting the number of occurrences of a specific value in a PySpark dataframe involves utilizing the built-in functions and methods provided by the PySpark library. This includes using the “filter” function to filter the dataframe based on the desired value, followed by the “count” function to calculate the number of rows in the filtered dataframe. Additionally, the “groupBy” function can be used to group the dataframe by the desired value and then applying the “count” function to obtain the frequency of that value. By combining these functions, it is possible to accurately count the number of occurrences of a specific value in a PySpark dataframe.

Count Number of Occurrences in PySpark


You can use the following methods to count the number of occurrences of values in a PySpark DataFrame:

Method 1: Count Number of Occurrences of Specific Value in Column

df.filter(df.my_column=='specific_value').count()

Method 2: Count Number of Occurrences of Each Value in Column

df.groupBy('my_column').count().show() 

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Guard', 13],
        ['B', 'Forward', 7],
        ['C', 'Guard', 8],
        ['C', 'Forward', 5]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B|   Guard|    13|
|   B| Forward|     7|
|   C|   Guard|     8|
|   C| Forward|     5|
+----+--------+------+

Example 1: Count Number of Occurrences of Specific Value in Column

We can use the following syntax to count the number of occurrences of ‘Forward’ in the position column of the DataFrame:

#count number of occurrences of 'Forward' in position column
df.filter(df.position=='Forward').count()

4

From the output we can see that ‘Forward’ occurs a total of 4 times in the position column.

Example 2: Count Number of Occurrences of Each Value in Column

We can use the following syntax to count the number of occurrences of each unique value in the team column of the DataFrame:

#count number of occurrences of each unique value in team column
df.groupBy('team').count().show() 

+----+-----+
|team|count|
+----+-----+
|   A|    4|
|   B|    4|
|   C|    2|
+----+-----+

From the output we can see:

  • The value ‘A’ occurs 4 times in the team column.
  • The value ‘B’ occurs 4 times in the team column.
  • The value ‘C’ occurs 2 times in the team column.

Additional Resources

x