PySpark: Use Alias After Groupby Count


You can use the following syntax to give a column an alias for a “count” column after performing a groupBy count in a PySpark DataFrame:

df.groupBy('team').count().withColumnRenamed('count', 'row_count').show()

This particular example counts the number of rows in the DataFrame, grouped by the team column.

Then we use the withColumnRenamed function to rename the “count” column to “row_count” in the resulting DataFrame.

The following example shows how to use this syntax in practice.

Example: How to Use Alias After Groupby Count in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Guard', 13],
        ['B', 'Forward', 7],
        ['C', 'Guard', 8],
        ['C', 'Forward', 5]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B|   Guard|    13|
|   B| Forward|     7|
|   C|   Guard|     8|
|   C| Forward|     5|
+----+--------+------+

We can use the following syntax to count the number of rows in the DataFrame grouped by the values in the team column:

#count number of rows by team
df.groupBy('team').count().show()

+----+-----+
|team|count|
+----+-----+
|   A|    4|
|   B|    4|
|   C|    2|
+----+-----+

By default, the count function simply uses “count” as the column name in the resulting DataFrame.

However, we could use the following syntax to instead use the name row_count as the column name in the resulting DataFrame:

#count number of rows by team and rename 'count' column to 'row_count'
df.groupBy('team').count().withColumnRenamed('count', 'row_count').show()

+----+---------+
|team|row_count|
+----+---------+
|   A|        4|
|   B|        4|
|   C|        2|
+----+---------+

The DataFrame now uses row_count as the column name, just as we specified.

x