How to Reshape DataFrame from Long to Wide in PySpark?

In PySpark, reshaping a DataFrame from long to wide format can be achieved by using the DataFrame.groupBy() and DataFrame.pivot() methods. The DataFrame.groupBy() method is used to group the data by one or more columns, while the DataFrame.pivot() method is used to reshape the data into a wider format with a new column as the pivot column and the existing columns as the new column values. Both of these methods can then be chained together to reshape the DataFrame from long to wide format.


You can use the following syntax to convert a PySpark DataFrame from a long format to a wide format:

df_wide = df.groupBy('team').pivot('player').sum('points')

In this scenario, the values from the team column will be shown along the rows, the values from the player column will be used as the column names, and the sum of values from the points column will be used as the values inside the DataFrame.

The following example shows how to use this syntax in practice.

Related:

Example: Reshape PySpark DataFrame from Long to Wide

Suppose we have the following PySpark DataFrame in a long format:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 1, 18], 
        ['A', 2, 33], 
        ['A', 3, 12], 
        ['A', 4, 15], 
        ['B', 1, 19],
        ['B', 2, 24],
        ['B', 3, 28],
        ['B', 4, 16]]

#define column names
columns = ['team', 'player', 'points']

#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+------+------+
|team|player|points|
+----+------+------+
|   A|     1|    18|
|   A|     2|    33|
|   A|     3|    12|
|   A|     4|    15|
|   B|     1|    19|
|   B|     2|    24|
|   B|     3|    28|
|   B|     4|    16|
+----+------+------+

We can use the following syntax to reshape this DataFrame from a long format to a wide format:

#create wide DataFrame
df_wide = df.groupBy('team').pivot('player').sum('points')

#view wide DataFrame
df_wide.show()

+----+---+---+---+---+
|team|  1|  2|  3|  4|
+----+---+---+---+---+
|   B| 19| 24| 28| 16|
|   A| 18| 33| 12| 15|
+----+---+---+---+---+

The DataFrame is now in a wide format.

The team is now shown along the rows, the player numbers are used as columns, and the sum of the points values are shown inside the DataFrame.

Note that we could instead use player as the rows and team as the columns if we’d like:

#create wide DataFrame
df_wide = df.groupBy('player').pivot('team').sum('points')

#view wide DataFrame
df_wide.show()

+------+---+---+
|player|  A|  B|
+------+---+---+
|     1| 18| 19|
|     3| 12| 28|
|     2| 33| 24|
|     4| 15| 16|
+------+---+---+

This DataFrame is also in a wide format.

Note: You can find the complete documentation for the PySpark pivot function .

x