How can I perform data binning in PySpark?

Data binning is a process of transforming continuous data into discrete intervals or bins. In PySpark, this can be accomplished using the Bucketizer class, which takes in a list of split points and assigns each data point to a bucket based on the split points. Additionally, the QuantileDiscretizer class can be used to create buckets with equal numbers of observations within them. Both of these classes are available in the pyspark.ml.feature module.


You can use the following syntax to perform data binning in a PySpark DataFrame:

from pyspark.ml.feature import Bucketizer

#specify bin ranges and column to bin
bucketizer = Bucketizer(splits=[0, 5, 10, 15, 20, float('Inf')],
                        inputCol='points',
                        outputCol='bins')

#perform binning based on values in 'points' column
df_bins = bucketizer.setHandleInvalid('keep').transform(df)

This particular example adds a new column to a DataFrame named bins that takes on the following values:

  • 0 if the value in the points column is in the range [0,5)
  • 1 if the value in the points column is in the range [5,10)
  • 2 if the value in the points column is in the range [10,15)
  • 3 if the value in the points column is in the range [15,20)
  • 4 if the value in the points column is in the range [20,Infinity)

The following example shows how to use this function in practice.

Example: How to Perform Data Binning in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by basketball players on various teams:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 3],
        ['B', 8], 
        ['C', 9], 
        ['D', 9], 
        ['E', 12], 
        ['F', None],
        ['G', 15],
        ['H', 17],
        ['I', 19],
        ['J', 22]] 
  
#define column names
columns = ['player', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+------+------+
|player|points|
+------+------+
|     A|     3|
|     B|     8|
|     C|     9|
|     D|     9|
|     E|    12|
|     F|  null|
|     G|    15|
|     H|    17|
|     I|    19|
|     J|    22|
+------+------+

We can use the following syntax to bin each of the values in the points column based on specific bin ranges:

from pyspark.ml.feature import Bucketizer

#specify bin ranges and column to bin
bucketizer = Bucketizer(splits=[0, 5, 10, 15, 20, float('Inf')],
                        inputCol='points',
                        outputCol='bins')

#perform binning based on values in 'points' column
df_bins = bucketizer.setHandleInvalid('keep').transform(df)

#view new DataFrame
df_bins.show()

+------+------+----+
|player|points|bins|
+------+------+----+
|     A|     3| 0.0|
|     B|     8| 1.0|
|     C|     9| 1.0|
|     D|     9| 1.0|
|     E|    12| 2.0|
|     F|  null|null|
|     G|    15| 3.0|
|     H|    17| 3.0|
|     I|    19| 3.0|
|     J|    22| 4.0|
+------+------+----+

The bins column now displays a value of 0, 1, 2, 3, 4 or null based on the corresponding value in the points column.

Note that the argument setHandleInvalid(‘keep’) specifies that any invalid values such as nulls should be kept and placed into their own bin.

We could also specify setHandleInvalid(‘skip’) to simply remove invalid values from the DataFrame:

from pyspark.ml.feature import Bucketizer

#specify bin ranges and column to bin
bucketizer = Bucketizer(splits=[0, 5, 10, 15, 20, float('Inf')],
                        inputCol='points',
                        outputCol='bins')

#perform binning based on values in 'points' column, remove invalid values
df_bins = bucketizer.setHandleInvalid('skip').transform(df)

#view new DataFrame
df_bins.show()

+------+------+----+
|player|points|bins|
+------+------+----+
|     A|     3| 0.0|
|     B|     8| 1.0|
|     C|     9| 1.0|
|     D|     9| 1.0|
|     E|    12| 2.0|
|     G|    15| 3.0|
|     H|    17| 3.0|
|     I|    19| 3.0|
|     J|    22| 4.0|
+------+------+----+

Notice that the row that contained null in the points column has simply been removed.

Note: You can find the complete documentation for the PySpark Bucketizer function .

x