What is the process for calculating the percentage of total in PySpark using groupBy?

The process for calculating the percentage of total in PySpark using groupBy involves first grouping the data by a specific category or attribute. Then, the total sum for each group is calculated. Next, the percentage of each group’s total is calculated by dividing the total sum of that group by the overall total. This result is then multiplied by 100 to get the percentage. Finally, the percentage of total can be displayed or saved as a new column in the grouped data.


You can use the following syntax to calculate the percentage of total rows that each group represents in a PySpark DataFrame:

#calculate total rows in DataFrame
n = df.count()

#calculate percent of total rows for each team
df.groupBy('team').count().withColumn('team_percent', (F.col('count')/n)*100).show()

This particular example counts the number of occurrences for each unique value in the team column and then calculates the percentage of total rows that each unique value represents.

The following example shows how to use this syntax in practice.

Example: Calculate Percentage of Total with groupBy in PySpark

Suppose we have the following PySpark DataFrame that contains information about the points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['C', 'Forward', 7]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   C| Forward|     7|
+----+--------+------+

We can use the following syntax to count the number of occurrences if each unique value in the team column and then calculate the percentage of total rows that each unique team value represents:

#calculate total rows in DataFrame
n = df.count()

#calculate percent of total rows for each team
df.groupBy('team').count().withColumn('team_percent', (F.col('count')/n)*100).show()

+----+-----+------------+
|team|count|team_percent|
+----+-----+------------+
|   A|    4|        50.0|
|   B|    3|        37.5|
|   C|    1|        12.5|
+----+-----+------------+

The team_percent column shows the percentage of total rows represented by each unique team.

For example, there are 8 total rows in the DataFrame.

From the team_percent column, we can see:

  • There are 4 occurrences of team A, which represents 4/8 = 50% of the total rows.
  • There are 3 occurrences of team B, which represents 3/8 = 37.5% of the total rows.
  • There is 1 occurrence of team C, which represents 1/8 = 12.5% of the total rows.

Note: You can find the complete documentation for the PySpark groupBy function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x