How can I create a Pivot Table in PySpark? Can you provide an example?


You can use the following syntax to create a pivot table from a  PySpark DataFrame:

df.groupBy('team').pivot('position').sum('points').show()

This particular example creates a pivot table using the team column as the rows, the position column as the columns in the pivot table and the sum of the points column as the values within the pivot table.

The following example shows how to use this syntax in practice.

Example: How to Create Pivot Table in PySpark

Suppose we have the following PySpark DataFrame that contains information about the points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'G', 4], 
        ['A', 'G', 4], 
        ['A', 'F', 6], 
        ['A', 'F', 8], 
        ['B', 'G', 9], 
        ['B', 'F', 5],
        ['B', 'F', 5],
        ['B', 'F', 12]]
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|       G|     4|
|   A|       G|     4|
|   A|       F|     6|
|   A|       F|     8|
|   B|       G|     9|
|   B|       F|     5|
|   B|       F|     5|
|   B|       F|    12|
+----+--------+------+

We can use the following syntax to create a pivot table using team as the rows, position as the columns and the sum of points as the values within the pivot table:

#create pivot table that shows sum of points by team and position
df.groupBy('team').pivot('position').sum('points').show()

+----+---+---+
|team|  F|  G|
+----+---+---+
|   B| 22|  9|
|   A| 14|  8|
+----+---+---+

The resulting pivot table shows the sum of the points values for each team and position.

For example, we can see:

  • Players on team B in position F scored a total of 22 points.
  • Players on team B in position G scored a total of 9 points.
  • Players on team A in position F scored a total of 14 points.
  • Players on team A in position G scored a total of 8 points.

Note that we could also use a different metric to summarize the points values if we’d like.

For example, we could use mean instead of sum:

#create pivot table that shows mean of points by team and position
df.groupBy('team').pivot('position').mean('points').show()

+----+-----------------+---+
|team|                F|  G|
+----+-----------------+---+
|   B|7.333333333333333|9.0|
|   A|              7.0|4.0|
+----+-----------------+---+

The resulting pivot table shows the mean of the points values for each team and position.

For example, we can see:

  • Players on team B in position F scored a mean of 7.33 points.
  • Players on team B in position G scored a mean of 9 points.
  • Players on team A in position F scored a mean of 7 points.
  • Players on team A in position G scored a mean of 4 points.

Feel free to use whichever summary metric you would like when creating your own pivot table.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x