How can I unpivot a PySpark DataFrame? Can you provide an example?

Unpivoting a PySpark DataFrame refers to the process of transforming a DataFrame from wide format to long format, where multiple columns are turned into rows. This can be achieved using the melt function in PySpark, which combines multiple columns into two columns – one containing the original column names and the other containing the corresponding values. An example of unpivoting a PySpark DataFrame is shown below:

Input DataFrame:

| id | name | subject_1 | subject_2 | subject_3 |
|—-|——|———–|———–|———–|
| 1 | John | 85 | 90 | 95 |
| 2 | Jane | 80 | 75 | 85 |

Unpivoted DataFrame:

| id | name | subject | score |
|—-|——|———–|——-|
| 1 | John | subject_1 | 85 |
| 1 | John | subject_2 | 90 |
| 1 | John | subject_3 | 95 |
| 2 | Jane | subject_1 | 80 |
| 2 | Jane | subject_2 | 75 |
| 2 | Jane | subject_3 | 85 |

This transformation allows for easier analysis and visualization of the data.

Unpivot a PySpark DataFrame (With Example)


You can use the unpivot function to unpivot a PySpark DataFrame.

The following example shows how to use this syntax in practice.

Example: How to Unpivot a PySpark DataFrame

Suppose we create the following PySpark DataFrame that contains information about the points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['B', 'Center', 7]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B|  Center|     7|
+----+--------+------+

We can use the following syntax to create a pivot table using team as the rows, position as the columns and the sum of points as the values within the pivot table:

#create pivot table that shows sum of points by team and position
df_pivot = df.groupBy('team').pivot('position').sum('points')

#view pivoted DataFrame
df_pivot.show()

+----+------+-------+-----+
|team|Center|Forward|Guard|
+----+------+-------+-----+
|   B|     7|     13|   28|
|   A|  null|     44|   19|
+----+------+-------+-----+

The resulting pivot table shows the sum of the points values for each team and position.

In order to unpivot this DataFrame, we can use the unpivot function with the following syntax:

#unpivot DataFrame
df_unpivot = df_pivot.unpivot(['team'], ['Center', 'Forward', 'Guard'], 'position', 'points')

#view unpivoted DataFrame
df_unpivot.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   B|  Center|     7|
|   B| Forward|    13|
|   B|   Guard|    28|
|   A|  Center|  null|
|   A| Forward|    44|
|   A|   Guard|    19|
+----+--------+------+

The DataFrame is now back to the original format with three columns.

Lastly, we can filter out any rows with a null value in the points column by using the filter function:

#filter out rows where points column is null
df_unpivot.filter(df_unpivot.points.isNotNull()).show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   B|  Center|     7|
|   B| Forward|    13|
|   B|   Guard|    28|
|   A| Forward|    44|
|   A|   Guard|    19|
+----+--------+------+

This final DataFrame has now been unpivoted and there are no rows with a null value in the points column.

Note: You can find the complete documentation for the PySpark unpivot function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x