How can a pivot table be created in PySpark?

A pivot table is a data summarization tool that allows users to quickly and easily aggregate and analyze large datasets. In PySpark, a pivot table can be created using the pivot method, which takes in three parameters: the column to be used as the pivot, the column to be used as the index, and the column to be aggregated. For example, if we have a dataset of sales data with columns for product, month, and revenue, we can create a pivot table to show the total revenue for each product in each month. This can be accomplished by using the pivot method with the product column as the pivot, the month column as the index, and the revenue column as the column to be aggregated. This will result in a new dataframe with the product names as the rows, the months as the columns, and the total revenue for each combination as the values. This pivot table can then be further analyzed and visualized to gain insights into the sales data.

Create a Pivot Table in PySpark (With Example)


You can use the following syntax to create a pivot table from a  PySpark DataFrame:

df.groupBy('team').pivot('position').sum('points').show()

This particular example creates a pivot table using the team column as the rows, the position column as the columns in the pivot table and the sum of the points column as the values within the pivot table.

The following example shows how to use this syntax in practice.

Example: How to Create Pivot Table in PySpark

Suppose we have the following PySpark DataFrame that contains information about the points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'G', 4], 
        ['A', 'G', 4], 
        ['A', 'F', 6], 
        ['A', 'F', 8], 
        ['B', 'G', 9], 
        ['B', 'F', 5],
        ['B', 'F', 5],
        ['B', 'F', 12]]
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|       G|     4|
|   A|       G|     4|
|   A|       F|     6|
|   A|       F|     8|
|   B|       G|     9|
|   B|       F|     5|
|   B|       F|     5|
|   B|       F|    12|
+----+--------+------+

We can use the following syntax to create a pivot table using team as the rows, position as the columns and the sum of points as the values within the pivot table:

#create pivot table that shows sum of points by team and position
df.groupBy('team').pivot('position').sum('points').show()

+----+---+---+
|team|  F|  G|
+----+---+---+
|   B| 22|  9|
|   A| 14|  8|
+----+---+---+

The resulting pivot table shows the sum of the points values for each team and position.

For example, we can see:

  • Players on team B in position F scored a total of 22 points.
  • Players on team B in position G scored a total of 9 points.
  • Players on team A in position F scored a total of 14 points.
  • Players on team A in position G scored a total of 8 points.

Note that we could also use a different metric to summarize the points values if we’d like.

For example, we could use mean instead of sum:

#create pivot table that shows mean of points by team and position
df.groupBy('team').pivot('position').mean('points').show()

+----+-----------------+---+
|team|                F|  G|
+----+-----------------+---+
|   B|7.333333333333333|9.0|
|   A|              7.0|4.0|
+----+-----------------+---+

The resulting pivot table shows the mean of the points values for each team and position.

For example, we can see:

  • Players on team B in position F scored a mean of 7.33 points.
  • Players on team B in position G scored a mean of 9 points.
  • Players on team A in position F scored a mean of 7 points.
  • Players on team A in position G scored a mean of 4 points.

Feel free to use whichever summary metric you would like when creating your own pivot table.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x