How can I use groupBy in PySpark to count distinct values?

GroupBy in PySpark is a powerful function that allows you to group data by a specific column or set of columns and perform operations on those groups. One of the common use cases of GroupBy in PySpark is to count the distinct values in a column. This can be achieved by using the “countDistinct” function, which calculates the number of unique values in a column for each group. By utilizing GroupBy and countDistinct, you can efficiently analyze large datasets and gain valuable insights into the data. This functionality is particularly useful in data preprocessing and exploratory data analysis tasks.

PySpark: Use groupBy with Count Distinct


You can use the following syntax to count the number of distinct values in one column of a PySpark DataFrame, grouped by another column:

from pyspark.sql.functions import countDistinct

df.groupBy('team').agg(countDistinct('points')).show()

This particular example calculates the number of distinct values in the points column, grouped by the values in the team column.

The following example shows how to use this syntax in practice.

Example: How to Use groupBy with Count Distinct in PySpark

Suppose we have the following PySpark DataFrame that contains information about the points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['B', 'Forward', 14],        ['C', 'Forward', 23],
        ['C', 'Guard', 30]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|    14|
|   C| Forward|    23|
|   C|   Guard|    30|
+----+--------+------+

We can use the following syntax to calculate the number of distinct values in the points column, grouped by the values in the team column:

from pyspark.sql.functions import countDistinct

#calculate distinct values in points column, grouped by team column
df.groupBy('team').agg(countDistinct('points')).show()

+----+-------------+
|team|count(points)|
+----+-------------+
|   B|            2|
|   C|            2|
|   A|            3|
+----+-------------+

The resulting DataFrame shows the number of distinct values in the points column, grouped by the values in the team column.

For example, we can see:

  • There are 2 distinct values in the points column for team B.
  • There are 2 distinct values in the points column for team C.
  • There are 3 distinct values in the points column for team A.

If you would like to give the count(points) column a different name, you can use the alias function as follows:

from pyspark.sql.functions import countDistinct

#calculate distinct values in points column, grouped by team column
df.groupBy('team').agg(countDistinct('points').alias('distinct_points')).show()

+----+---------------+
|team|distinct_points|
+----+---------------+
|   B|              2|
|   C|              2|
|   A|              3|
+----+---------------+

The resulting DataFrame shows the number of distinct points values for each team with the distinct column now named distinct_points, just as we specified in the alias function.

Note: You can find the complete documentation for the PySpark groupBy function .

Additional Resources

x