How can I use an alias after performing a groupby count in PySpark?

Using an alias after performing a groupby count in PySpark allows you to assign a custom name to the resulting count column, making it easier to refer to and manipulate in subsequent steps. This can be achieved by using the PySpark function “alias()”, which renames a column in a DataFrame. By assigning an alias to the count column, you can improve the readability and organization of your code, as well as make it more efficient when performing further operations on the grouped data.

PySpark: Use Alias After Groupby Count


You can use the following syntax to give a column an alias for a “count” column after performing a groupBy count in a PySpark DataFrame:

df.groupBy('team').count().withColumnRenamed('count', 'row_count').show()

This particular example counts the number of rows in the DataFrame, grouped by the team column.

Then we use the withColumnRenamed function to rename the “count” column to “row_count” in the resulting DataFrame.

The following example shows how to use this syntax in practice.

Example: How to Use Alias After Groupby Count in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Guard', 13],
        ['B', 'Forward', 7],
        ['C', 'Guard', 8],
        ['C', 'Forward', 5]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B|   Guard|    13|
|   B| Forward|     7|
|   C|   Guard|     8|
|   C| Forward|     5|
+----+--------+------+

We can use the following syntax to count the number of rows in the DataFrame grouped by the values in the team column:

#count number of rows by team
df.groupBy('team').count().show()

+----+-----+
|team|count|
+----+-----+
|   A|    4|
|   B|    4|
|   C|    2|
+----+-----+

By default, the count function simply uses “count” as the column name in the resulting DataFrame.

However, we could use the following syntax to instead use the name row_count as the column name in the resulting DataFrame:

#count number of rows by team and rename 'count' column to 'row_count'
df.groupBy('team').count().withColumnRenamed('count', 'row_count').show()

+----+---------+
|team|row_count|
+----+---------+
|   A|        4|
|   B|        4|
|   C|        2|
+----+---------+

The DataFrame now uses row_count as the column name, just as we specified.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x