Table of Contents
PySpark is a powerful open-source library for big data processing in Python. One common task in data analysis is finding the maximum value of a particular column in a dataset. To do this in PySpark, you can use the `agg()` function with the `max()` method on the desired column. This will return the maximum value of that column, which can then be stored in a variable or used for further analysis. Overall, PySpark provides a convenient and efficient way to calculate the maximum value of a column in a dataset.
Calculate the Max Value of a Column in PySpark
You can use the following methods to calculate the max value of a column in a PySpark DataFrame:
Method 1: Calculate Max for One Specific Column
from pyspark.sql import functions as F
#calculate max of column named 'game1'
df.agg(F.max('game1')).collect()[0][0]
Method 2: Calculate Max for Multiple Columns
from pyspark.sql.functions import max
#calculate max for game1, game2 and game3 columns
df.select(max(df.game1), max(df.game2), max(df.game3)).show()
The following examples show how to use each method in practice with the following PySpark DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['Mavs', 25, 11, 10],
['Nets', 22, 8, 14],
['Hawks', 14, 22, 10],
['Kings', 30, 22, 35],
['Bulls', 15, 14, 12],
['Blazers', 10, 14, 18]]
#define column names
columns = ['team', 'game1', 'game2', 'game3']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+-------+-----+-----+-----+
| team|game1|game2|game3|
+-------+-----+-----+-----+
| Mavs| 25| 11| 10|
| Nets| 22| 8| 14|
| Hawks| 14| 22| 10|
| Kings| 30| 22| 35|
| Bulls| 15| 14| 12|
|Blazers| 10| 14| 18|
+-------+-----+-----+-----+
Example 1: Calculate Max for One Specific Column
We can use the following syntax to calculate the max of values in the game1 column of the DataFrame only:
from pyspark.sql import functions as F
#calculate max of column named 'game1'
df.agg(F.max('game1')).collect()[0][0]
30
The max of values in the game1 column turns out to be 30.
We can verify this is correct by manually identifying the max of the values in this column:
All values in game1 column: 10, 14, 15, 22, 25, 30
We can see that 30 is indeed the max value in the column.
Example 2: Calculate Max for Multiple Columns
We can use the following syntax to calculate the max of values for the game1, game2 and game3 columns of the DataFrame:
from pyspark.sql.functions import max
#calculate max for game1, game2 and game3 columns
df.select(max(df.game1), max(df.game2), max(df.game3)).show()
+----------+----------+----------+
|max(game1)|max(game2)|max(game3)|
+----------+----------+----------+
| 30| 22| 35|
+----------+----------+----------+
- The max of values in the game1 column is 30.
- The max of values in the game2 column is 22.
- The max of values in the game3 column is 35.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark: