How can I sort a pivot table in PySpark by the values in a specific column?

Pivot tables in PySpark are powerful data analysis tools that allow users to group and aggregate data based on specific columns. However, in certain cases, it may be necessary to sort the pivot table by the values in a particular column to gain further insights. To achieve this, users can utilize the “orderBy” function in PySpark, which allows them to specify the column by which the pivot table should be sorted. This function can be applied directly to the pivot table object, and the resulting data will be displayed in ascending or descending order based on the values in the specified column. Sorting a pivot table in PySpark using the “orderBy” function can greatly enhance the analysis process and aid in making informed decisions.

PySpark: Sort Pivot Table by Values in Column


You can use the following syntax to sort the rows in a pivot table in PySpark based on values in a specific column:

df_pivot.orderBy('my_column').show()

This particular example sorts the rows in the pivot table called df_pivot based on the values in the column named my_column.

The following example shows how to use this syntax in practice.

Example: How to Sort a Pivot Table in PySpark

Suppose we have the following PySpark DataFrame that contains information about the points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 14],
        ['A', 'Guard', 4],
        ['A', 'Forward', 16],
        ['A', 'Forward', 18],
        ['B', 'Guard', 9],
        ['B', 'Forward', 5],
        ['B', 'Forward', 25],
        ['C', 'Forward', 12],
        ['C', 'Guard', 14],
        ['C', 'Guard', 23]]

#define column names
columns = ['team', 'position', 'points']

#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    14|
|   A|   Guard|     4|
|   A| Forward|    16|
|   A| Forward|    18|
|   B|   Guard|     9|
|   B| Forward|     5|
|   B| Forward|    25|
|   C| Forward|    12|
|   C|   Guard|    14|
|   C|   Guard|    23|
+----+--------+------+

We can use the following syntax to create a pivot table using team as the rows, position as the columns and the sum of points as the values within the pivot table:

#create pivot table that shows sum of points by team and position
df_pivot = df.groupBy('team').pivot('position').sum('points')

#view pivot table
df_pivot.show()

+----+-------+-----+
|team|Forward|Guard|
+----+-------+-----+
|   B|     30|    9|
|   C|     12|   37|
|   A|     34|   18|
+----+-------+-----+

The resulting pivot table shows the sum of the points values for each team and position.

We can use the following syntax to sort the rows of the pivot table in ascending order based on the values in the Forward column:

#sort rows of pivot table by values in 'Forward' column in ascending order
df_pivot.orderBy('Forward').show()

+----+-------+-----+
|team|Forward|Guard|
+----+-------+-----+
|   C|     12|   37|
|   B|     30|    9|
|   A|     34|   18|
+----+-------+-----+

Notice that the rows in the pivot table are now sorted in ascending order based on the values in the Forward column.

If you would instead like to sort the rows in descending order, you can use the argument ascending=False as follows:

#sort rows of pivot table by values in 'Forward' column in descending order
df_pivot.orderBy('Forward', ascending=False).show()

+----+-------+-----+
|team|Forward|Guard|
+----+-------+-----+
|   A|     34|   18|
|   B|     30|    9|
|   C|     12|   37|
+----+-------+-----+

The rows in the pivot table are now sorted in descending order based on the values in the Forward column.

Note: You can find the complete documentation for the PySpark orderBy function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x