How can I create a boolean column in PySpark based on a condition?

To create a boolean column in PySpark based on a condition, the user can use the “when” function from the “pyspark.sql.functions” module. This function allows the user to define a condition and assign a boolean value to a new column. The “when” function can be used in conjunction with other PySpark functions and methods to manipulate data and create the desired boolean column. This method is useful for filtering and categorizing data in a PySpark dataframe based on specific criteria.

PySpark: Create Boolean Column Based on Condition

You can use the following syntax to create a boolean column based on a condition in a PySpark DataFrame:

df_new = df.withColumn('good_player', df.points>20)

This particular example creates a boolean column named good_player that returns one of two values:

  • true if the value in the points column is greater than 20.
  • false if the value in the points column is not greater than 20.

The following example shows how to use this syntax in practice.

Example: Create Boolean Column Based on Condition in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12], 
        ['Kings', 15], 
        ['Hawks', 19],
        ['Wizards', 24],
        ['Magic', 28],
        ['Jazz', 40],
        ['Thunder', 24],
        ['Spurs', 13]]
#define column names
columns = ['team', 'points'] 
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
#view dataframe

|   team|points|
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|  Kings|    15|
|  Hawks|    19|
|Wizards|    24|
|  Magic|    28|
|   Jazz|    40|
|Thunder|    24|
|  Spurs|    13|

Suppose we would like to create a new boolean column that contains true if the corresponding value in the points column is greater than 20 or false otherwise.

We can use the following syntax to do so:

#create boolean column based on value in points column
df_new = df.withColumn('good_player', df.points>20)

#view new DataFrame

|   team|points|good_player|
|   Mavs|    18|      false|
|   Nets|    33|       true|
| Lakers|    12|      false|
|  Kings|    15|      false|
|  Hawks|    19|      false|
|Wizards|    24|       true|
|  Magic|    28|       true|
|   Jazz|    40|       true|
|Thunder|    24|       true|
|  Spurs|    13|      false|

The new good_player column returns either true of false based on the value in the points column. 

Note: The withColumn function returns a new DataFrame with a specific column modified and all other columns left the same.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:
