Table of Contents
To create a boolean column in PySpark based on a condition, the user can use the “when” function from the “pyspark.sql.functions” module. This function allows the user to define a condition and assign a boolean value to a new column. The “when” function can be used in conjunction with other PySpark functions and methods to manipulate data and create the desired boolean column. This method is useful for filtering and categorizing data in a PySpark dataframe based on specific criteria.
PySpark: Create Boolean Column Based on Condition
You can use the following syntax to create a boolean column based on a condition in a PySpark DataFrame:
df_new = df.withColumn('good_player', df.points>20)
This particular example creates a boolean column named good_player that returns one of two values:
- true if the value in the points column is greater than 20.
- false if the value in the points column is not greater than 20.
The following example shows how to use this syntax in practice.
Example: Create Boolean Column Based on Condition in PySpark
Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['Mavs', 18], ['Nets', 33], ['Lakers', 12], ['Kings', 15], ['Hawks', 19], ['Wizards', 24], ['Magic', 28], ['Jazz', 40], ['Thunder', 24], ['Spurs', 13]] #define column names columns = ['team', 'points'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +-------+------+ | team|points| +-------+------+ | Mavs| 18| | Nets| 33| | Lakers| 12| | Kings| 15| | Hawks| 19| |Wizards| 24| | Magic| 28| | Jazz| 40| |Thunder| 24| | Spurs| 13| +-------+------+
Suppose we would like to create a new boolean column that contains true if the corresponding value in the points column is greater than 20 or false otherwise.
We can use the following syntax to do so:
#create boolean column based on value in points column df_new = df.withColumn('good_player', df.points>20) #view new DataFrame df_new.show() +-------+------+-----------+ | team|points|good_player| +-------+------+-----------+ | Mavs| 18| false| | Nets| 33| true| | Lakers| 12| false| | Kings| 15| false| | Hawks| 19| false| |Wizards| 24| true| | Magic| 28| true| | Jazz| 40| true| |Thunder| 24| true| | Spurs| 13| false| +-------+------+-----------+
The new good_player column returns either true of false based on the value in the points column.
Note: The withColumn function returns a new DataFrame with a specific column modified and all other columns left the same.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark: