How can I reshape a DataFrame from long to wide in PySpark?

Reshaping a DataFrame from long to wide in PySpark refers to the process of reorganizing the structure of a DataFrame to have a wider format. This involves transforming data from multiple rows into a single row, where each column represents a unique value or variable. This can be achieved using various techniques such as pivoting, grouping, and aggregating data. This process is useful for data analysis and visualization, as it allows for easier comparison and understanding of data across different variables. In PySpark, this can be accomplished through the use of functions such as pivot, groupBy, and agg, which provide efficient and flexible ways to reshape data.

PySpark: Reshape DataFrame from Long to Wide


You can use the following syntax to convert a PySpark DataFrame from a long format to a wide format:

df_wide = df.groupBy('team').pivot('player').sum('points')

In this scenario, the values from the team column will be shown along the rows, the values from the player column will be used as the column names, and the sum of values from the points column will be used as the values inside the DataFrame.

The following example shows how to use this syntax in practice.

Related:

Example: Reshape PySpark DataFrame from Long to Wide

Suppose we have the following PySpark DataFrame in a long format:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 1, 18], 
        ['A', 2, 33], 
        ['A', 3, 12], 
        ['A', 4, 15], 
        ['B', 1, 19],
        ['B', 2, 24],
        ['B', 3, 28],
        ['B', 4, 16]]

#define column names
columns = ['team', 'player', 'points']

#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+------+------+
|team|player|points|
+----+------+------+
|   A|     1|    18|
|   A|     2|    33|
|   A|     3|    12|
|   A|     4|    15|
|   B|     1|    19|
|   B|     2|    24|
|   B|     3|    28|
|   B|     4|    16|
+----+------+------+

We can use the following syntax to reshape this DataFrame from a long format to a wide format:

#create wide DataFrame
df_wide = df.groupBy('team').pivot('player').sum('points')

#view wide DataFrame
df_wide.show()

+----+---+---+---+---+
|team|  1|  2|  3|  4|
+----+---+---+---+---+
|   B| 19| 24| 28| 16|
|   A| 18| 33| 12| 15|
+----+---+---+---+---+

The DataFrame is now in a wide format.

The team is now shown along the rows, the player numbers are used as columns, and the sum of the points values are shown inside the DataFrame.

Note that we could instead use player as the rows and team as the columns if we’d like:

#create wide DataFrame
df_wide = df.groupBy('player').pivot('team').sum('points')

#view wide DataFrame
df_wide.show()

+------+---+---+
|player|  A|  B|
+------+---+---+
|     1| 18| 19|
|     3| 12| 28|
|     2| 33| 24|
|     4| 15| 16|
+------+---+---+

This DataFrame is also in a wide format.

Note: You can find the complete documentation for the PySpark pivot function .

Additional Resources

x