How can I calculate the minimum value of a column in PySpark?

To calculate the minimum value of a column in PySpark, you can use the “agg” function with the “min” method. This will allow you to perform an aggregate operation on the specified column and return the minimum value. Additionally, you can use the “select” method to select the column you want to calculate the minimum value for, followed by the “show” method to display the result. This approach allows for efficient and accurate calculation of the minimum value of a column in PySpark.

Calculate the Minimum Value of a Column in PySpark


You can use the following methods to calculate the minimum value of a column in a PySpark DataFrame:

Method 1: Calculate Minimum for One Specific Column

from pyspark.sql import functions as F

#calculate minimum of column named 'game1'
df.agg(F.min('game1')).collect()[0][0]

Method 2: Calculate Minimum for Multiple Columns

from pyspark.sql.functions import min

#calculate minimum for game1, game2 and game3 columns
df.select(min(df.game1), min(df.game2), min(df.game3)).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 25, 11, 10], 
        ['Nets', 22, 8, 14], 
        ['Hawks', 14, 22, 10], 
        ['Kings', 30, 22, 35], 
        ['Bulls', 15, 14, 12], 
        ['Blazers', 10, 14, 18]] 
  
#define column names
columns = ['team', 'game1', 'game2', 'game3'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+-----+-----+-----+
|   team|game1|game2|game3|
+-------+-----+-----+-----+
|   Mavs|   25|   11|   10|
|   Nets|   22|    8|   14|
|  Hawks|   14|   22|   10|
|  Kings|   30|   22|   35|
|  Bulls|   15|   14|   12|
|Blazers|   10|   14|   18|
+-------+-----+-----+-----+

Example 1: Calculate Minimum for One Specific Column

We can use the following syntax to calculate the minimum of values in the game1 column of the DataFrame only:

from pyspark.sql import functions as F

#calculate minimum of column named 'game1'
df.agg(F.min('game1')).collect()[0][0]

10

The minimum of values in the game1 column turns out to be 30.

We can verify this is correct by manually identifying the minimum of the values in this column:

All values in game1 column: 10, 14, 15, 22, 25, 30

We can see that 30 is indeed the minimum value in the column.

Example 2: Calculate Minimum for Multiple Columns

We can use the following syntax to calculate the minimum of values for the game1, game2 and game3 columns of the DataFrame:

from pyspark.sql.functions import min

#calculate minimum for game1, game2 and game3 columns
df.select(min(df.game1), min(df.game2), min(df.game3)).show()

+----------+----------+----------+
|min(game1)|min(game2)|min(game3)|
+----------+----------+----------+
|        10|         8|        10|
+----------+----------+----------+
  • The minimum of values in the game1 column is 10.
  • The minimum of values in the game2 column is 8.
  • The minimum of values in the game3 column is 10.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x