How can I reshape a PySpark DataFrame from wide to long format?

Reshaping a PySpark DataFrame from wide to long format involves reorganizing the data in a way that each row represents a unique combination of variables, rather than multiple columns representing the same variable. This can be achieved using the melt function in PySpark, which transforms the data by specifying the columns to be used as identifiers and the columns to be melted into a single column. This process is useful for data analysis and visualization, as it allows for easier manipulation and interpretation of the data. Additionally, it can help in reducing the overall size of the DataFrame, making it more efficient for processing and storage.

PySpark: Reshape DataFrame from Wide to Long


You can use the melt function with the following basic syntax to convert a PySpark DataFrame from a wide format to a long format:

df_long = df.melt(ids=['team'], values=['Guard', 'Forward', 'Center'], 
                  variableColumnName='position', 
                  valueColumnName='points')

This particular example converts a wide DataFrame named df to a long DataFrame named df_long.

The following example shows how to use this syntax in practice.

Related:

Example: Reshape PySpark DataFrame from Wide to Long

Suppose we have the following PySpark DataFrame in a wide format:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 22, 34, 17 ], 
        ['B', 25, 10, 12]]

#define column names
columns = ['team', 'Guard', 'Forward', 'Center']

#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+-----+-------+------+
|team|Guard|Forward|Center|
+----+-----+-------+------+
|   A|   22|     34|    17|
|   B|   25|     10|    12|
+----+-----+-------+------+

We can use the following syntax to reshape this DataFrame from a wide format to a long format:

#create long DataFrame
df_long = df.melt(ids=['team'], values=['Guard', 'Forward', 'Center'], 
                  variableColumnName='position', 
                  valueColumnName='points')

#view long DataFrame
df_long.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    22|
|   A| Forward|    34|
|   A|  Center|    17|
|   B|   Guard|    25|
|   B| Forward|    10|
|   B|  Center|    12|
+----+--------+------+

The DataFrame is now in a long format.

The team is now shown along the rows, the positions are used as values in the second column, and the points values are shown in the third column.

Note that we used the arguments variableColumnName and valueColumnName to specify the names to use for the second and third columns.

Note: You can find the complete documentation for the PySpark melt function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x