How to use withColumn() with IF ELSE in PySpark?

The withColumn() function of PySpark can be used to add columns to a Spark DataFrame, and IF ELSE can be used to create a new column based on the condition of the values of an existing column. This allows for more complex data manipulation and analysis to be done on the DataFrame.


You can use the following syntax to use the withColumn() function in PySpark with IF ELSE logic:

from pyspark.sql.functions import when

#create new column that contains 'Good' or 'Bad' based on value in points column
df_new = df.withColumn('rating', when(df.points>20, 'Good').otherwise('Bad'))

This particular example creates a new column named rating that returns ‘Good’ if the value in the points column is greater than 20 or the ‘Bad’ otherwise.

The following example shows how to use this syntax in practice.

Example: How to Use withColumn() with IF ELSE in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by basketball players on various teams:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12], 
        ['Kings', 15], 
        ['Hawks', 19],
        ['Wizards', 24],
        ['Magic', 28],
        ['Jazz', 40],
        ['Thunder', 24],
        ['Spurs', 13]]
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|  Kings|    15|
|  Hawks|    19|
|Wizards|    24|
|  Magic|    28|
|   Jazz|    40|
|Thunder|    24|
|  Spurs|    13|
+-------+------+

We can use the following syntax to create a new column named rating that returns ‘Good’ if the value in the points column is greater than 20 or the ‘Bad’ otherwise:

from pyspark.sql.functions import when

#create new column that contains 'Good' or 'Bad' based on value in points column
df_new = df.withColumn('rating', when(df.points>20, 'Good').otherwise('Bad'))

#view new DataFrame
df_new.show()

+-------+------+------+
|   team|points|rating|
+-------+------+------+
|   Mavs|    18|   Bad|
|   Nets|    33|  Good|
| Lakers|    12|   Bad|
|  Kings|    15|   Bad|
|  Hawks|    19|   Bad|
|Wizards|    24|  Good|
|  Magic|    28|  Good|
|   Jazz|    40|  Good|
|Thunder|    24|  Good|
|  Spurs|    13|   Bad|
+-------+------+------+

The new rating column now displays either ‘Good’ or ‘Bad’ based on the corresponding value in the points column.

For example:

  • The value of points in the first row is not greater than 20, so the rating column returns Bad.
  • The value of points in the second row is greater than 20, so the rating column returns Good.

And so on.

Note that you could also return numeric values if you’d like.

For example, you can use the following syntax to create a new column named rating that returns 1 if the value in the points column is greater than 20 or the 0 otherwise:

from pyspark.sql.functions import when

#create new column that contains 1 or 0 based on value in points column
df_new = df.withColumn('rating', when(df.points>20, 1).otherwise(0))

#view new DataFrame
df_new.show()

+-------+------+------+
|   team|points|rating|
+-------+------+------+
|   Mavs|    18|     0|
|   Nets|    33|     1|
| Lakers|    12|     0|
|  Kings|    15|     0|
|  Hawks|    19|     0|
|Wizards|    24|     1|
|  Magic|    28|     1|
|   Jazz|    40|     1|
|Thunder|    24|     1|
|  Spurs|    13|     0|
+-------+------+------+

We can see that the new rating column now contains either 0 or 1.

The following tutorials explain how to perform other common tasks in PySpark:

x