What is the process for creating a Boolean column in PySpark based on a condition?

The process for creating a Boolean column in PySpark based on a condition involves using the “when” function to specify the condition and then using the “otherwise” function to assign a value of either “True” or “False” to the new column. This can be done either by creating a new column using the “withColumn” function or by using the “select” function to add the new column to an existing dataframe. The resulting column will contain a Boolean value based on the specified condition.

You can use the following syntax to create a boolean column based on a condition in a PySpark DataFrame:

df_new = df.withColumn('good_player', df.points>20)

This particular example creates a boolean column named good_player that returns one of two values:

  • true if the value in the points column is greater than 20.
  • false if the value in the points column is not greater than 20.

The following example shows how to use this syntax in practice.

Example: Create Boolean Column Based on Condition in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12], 
        ['Kings', 15], 
        ['Hawks', 19],
        ['Wizards', 24],
        ['Magic', 28],
        ['Jazz', 40],
        ['Thunder', 24],
        ['Spurs', 13]]
#define column names
columns = ['team', 'points'] 
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
#view dataframe

|   team|points|
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|  Kings|    15|
|  Hawks|    19|
|Wizards|    24|
|  Magic|    28|
|   Jazz|    40|
|Thunder|    24|
|  Spurs|    13|

Suppose we would like to create a new boolean column that contains true if the corresponding value in the points column is greater than 20 or false otherwise.

We can use the following syntax to do so:

#create boolean column based on value in points column
df_new = df.withColumn('good_player', df.points>20)

#view new DataFrame

|   team|points|good_player|
|   Mavs|    18|      false|
|   Nets|    33|       true|
| Lakers|    12|      false|
|  Kings|    15|      false|
|  Hawks|    19|      false|
|Wizards|    24|       true|
|  Magic|    28|       true|
|   Jazz|    40|       true|
|Thunder|    24|       true|
|  Spurs|    13|      false|

The new good_player column returns either true of false based on the value in the points column. 

Note: The withColumn function returns a new DataFrame with a specific column modified and all other columns left the same.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:
