How to reshape DataFrame from wide to long in PySpark

In PySpark, reshaping DataFrame from wide to long can be done using the selectExpr, groupBy, pivot and agg functions. selectExpr is used to select the columns that are to be included in the reshaped DataFrame, groupBy is used to group the data by a key column, pivot is used to reshape the data from wide to long, and agg is used to aggregate the data. The reshaped DataFrame will have the key column, the variable column, and the values column.


You can use the melt function with the following basic syntax to convert a PySpark DataFrame from a wide format to a long format:

df_long = df.melt(ids=['team'], values=['Guard', 'Forward', 'Center'], 
                  variableColumnName='position', 
                  valueColumnName='points')

This particular example converts a wide DataFrame named df to a long DataFrame named df_long.

The following example shows how to use this syntax in practice.

Related:

Example: Reshape PySpark DataFrame from Wide to Long

Suppose we have the following PySpark DataFrame in a wide format:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 22, 34, 17 ], 
        ['B', 25, 10, 12]]

#define column names
columns = ['team', 'Guard', 'Forward', 'Center']

#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+-----+-------+------+
|team|Guard|Forward|Center|
+----+-----+-------+------+
|   A|   22|     34|    17|
|   B|   25|     10|    12|
+----+-----+-------+------+

We can use the following syntax to reshape this DataFrame from a wide format to a long format:

#create long DataFrame
df_long = df.melt(ids=['team'], values=['Guard', 'Forward', 'Center'], 
                  variableColumnName='position', 
                  valueColumnName='points')

#view long DataFrame
df_long.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    22|
|   A| Forward|    34|
|   A|  Center|    17|
|   B|   Guard|    25|
|   B| Forward|    10|
|   B|  Center|    12|
+----+--------+------+

The DataFrame is now in a long format.

The team is now shown along the rows, the positions are used as values in the second column, and the points values are shown in the third column.

Note that we used the arguments variableColumnName and valueColumnName to specify the names to use for the second and third columns.

Note: You can find the complete documentation for the PySpark melt function .

The following tutorials explain how to perform other common tasks in PySpark:

x