Table of Contents
The Count by Group function in PySpark allows users to perform a group-level count on a specified column or set of columns in a PySpark DataFrame. This function aggregates data at a group level and counts the number of rows within each group. This can be useful for analyzing large datasets and identifying patterns within groups. For example, it can be used to count the number of sales made by each salesperson in a company, or the number of customers in each age group. Overall, the Count by Group function in PySpark provides a convenient and efficient way to summarize data at a group level.
Count by Group in PySpark (With Examples)
You can use the following methods to count values by group in a PySpark DataFrame:
Method 1: Count Values Grouped by One Column
df.groupBy('col1').count().show()
Method 2: Count Values Grouped by Multiple Columns
df.groupBy('col1', 'col2').count().show()
The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'Guard', 11], ['A', 'Guard', 8], ['A', 'Forward', 22], ['A', 'Forward', 22], ['B', 'Guard', 14], ['B', 'Guard', 14], ['B', 'Guard', 13], ['B', 'Forward', 7], ['C', 'Guard', 8], ['C', 'Forward', 5]] #define column names columns = ['team', 'position', 'points'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +----+--------+------+ |team|position|points| +----+--------+------+ | A| Guard| 11| | A| Guard| 8| | A| Forward| 22| | A| Forward| 22| | B| Guard| 14| | B| Guard| 14| | B| Guard| 13| | B| Forward| 7| | C| Guard| 8| | C| Forward| 5| +----+--------+------+
Example 1: Count Values Grouped by One Column
We can use the following syntax to count the number of rows in the DataFrame grouped by the values in the team column:
#count number of values by team df.groupBy('team').count().show() +----+-----+ |team|count| +----+-----+ | A| 4| | B| 4| | C| 2| +----+-----+
From the output we can see:
- There are 4 players on team A
- There are 4 players on team B
- There are 2 players on team C
Example 2: Count Values Grouped by Multiple Columns
We can use the following syntax to count the number of rows in the DataFrame grouped by the values in the team and position columns:
#count number of values by team and position df.groupBy('team', 'position').count().show() +----+--------+-----+ |team|position|count| +----+--------+-----+ | A| Guard| 2| | A| Forward| 2| | B| Guard| 3| | B| Forward| 1| | C| Forward| 1| | C| Guard| 1| +----+--------+-----+
From the output we can see:
- There are 2 players on team A with a position of Guard.
- There are 2 players on team A with a position of Forward.
- There are 3 players on team B with a position of Guard.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark: