How can I use PySpark to calculate the maximum value across multiple columns in a dataframe?

PySpark is a powerful tool that allows users to perform data analysis and manipulation on large datasets. One useful function of PySpark is the ability to calculate the maximum value across multiple columns in a dataframe. This means that users can easily find the highest value in a particular row or column, even if it is spread across multiple columns. By using PySpark, users can efficiently perform this task on large datasets, making it a valuable tool for data analysis and decision-making.

PySpark: Calculate Max Value Across Columns


You can use the following syntax to calculate the max value across multiple columns in a PySpark DataFrame:

from pyspark.sql.functions import greatest

#find max value across columns 'game1', 'game2', and 'game3'
df_new = df.withColumn('max', greatest('game1', 'game2', 'game3'))

This particular example creates a new column called max that contains the max of values across the game1, game2 and game3 columns in the DataFrame.

The following example shows how to use this syntax in practice.

Example: How to Calculate Max Value Across Columns in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players during three different games:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 25, 11, 10], 
        ['Nets', 22, 8, 14], 
        ['Hawks', 14, 22, 10], 
        ['Kings', 30, 22, 35], 
        ['Bulls', 15, 14, 12], 
        ['Blazers', 10, 14, 18]] 
  
#define column names
columns = ['team', 'game1', 'game2', 'game3'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+-----+-----+-----+
|   team|game1|game2|game3|
+-------+-----+-----+-----+
|   Mavs|   25|   11|   10|
|   Nets|   22|    8|   14|
|  Hawks|   14|   22|   10|
|  Kings|   30|   22|   35|
|  Bulls|   15|   14|   12|
|Blazers|   10|   14|   18|
+-------+-----+-----+-----+

Suppose we would like to add a new column call max that contains the max of points scored by each player across all three games.

We can use the following syntax to do so:

from pyspark.sql.functions import greatest

#find max value across columns 'game1', 'game2', and 'game3'
df_new = df.withColumn('max', greatest('game1', 'game2', 'game3'))

#view new DataFrame
df_new.show()

+-------+-----+-----+-----+---+
|   team|game1|game2|game3|max|
+-------+-----+-----+-----+---+
|   Mavs|   25|   11|   10| 25|
|   Nets|   22|    8|   14| 22|
|  Hawks|   14|   22|   10| 22|
|  Kings|   30|   22|   35| 35|
|  Bulls|   15|   14|   12| 15|
|Blazers|   10|   14|   18| 18|
+-------+-----+-----+-----+---+

Notice that the new max column contains the max of values across the game1, game2 and game3 columns.

For example:

  • The max of points for the Mavs player is 25
  • The max of points for the Nets player is 22
  • The max of points for the Hawks player is 22

And so on.

Note that we used the withColumn function to return a new DataFrame with the max column added and all other columns left the same.

You can find the complete documentation for the PySpark withColumn function .

Additional Resources

x