How can I calculate the mode of a column in PySpark?

To calculate the mode of a column in PySpark, you can use the built-in function “mode()” which returns the most frequently occurring value in the column. This function is applied on a DataFrame or a Series object and can be used for both numerical and categorical data. It is important to note that if there is more than one value with the same frequency, the mode() function will return the first occurrence. Additionally, the mode() function can handle null values and will not consider them in the calculation. By using this function, you can easily obtain the most common value in a column and use it for further analysis or data manipulation.

Calculate the Mode of a Column in PySpark


You can use the following methods to calculate the of a column in a PySpark DataFrame:

Method 1: Calculate Mode for One Specific Column

#calculate mode of 'conference' column
df.groupby('conference').count().orderBy('count', ascending=False).first()[0]

Method 2: Calculate Mode for All Columns

#calculate mode of each column in the DataFrame
[[i,df.groupby(i).count().orderBy('count', ascending=False)
  .first()[0]] for i in df.columns]

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Calculate Mode for One Specific Column

We can use the following syntax to calculate the mode of the conference column of the DataFrame only:

#calculate mode of 'conference' column
df.groupby('conference').count().orderBy('count', ascending=False).first()[0]

'East'

The mode of the conference column is East.

This represents the most frequently occurring value in the conference column.

Example 2: Calculate Mode for All Columns

We can use the following syntax to calculate the mode in each column of the DataFrame:

#calculate mode of each column in the DataFrame
[[i,df.groupby(i).count().orderBy('count', ascending=False)
  .first()[0]] for i in df.columns]

[['team', 'A'], ['conference', 'East'], ['points', 6], ['assists', 4]]

The output shows the mode for each column in the DataFrame.

For example, we can see:

  • The mode of the team column is ‘A’
  • The mode of the conference column is ‘East’
  • The mode of the points column is 6
  • The mode of the assists column is 4

Note: In both examples, we used the groupby and count functions to count the occurrences of each unique value in the column, then we simply extracted the value with the highest frequency count to get the mode.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x