How to convert a column from boolean to integer in PySpark?

In PySpark, you can convert a column from boolean to integer by using the .cast() method and specifying the new data type as ‘integer’. This will change the column to integer values, with True being converted to a 1 and False being converted to a 0. You can also use this method to convert integer columns to boolean by changing the data type to boolean.


You can use the following syntax to convert a column from a Boolean to an integer in PySpark:

from pyspark.sql.functions import when

#convert Boolean column to integer column
df_new = df.withColumn('int_column', when(df.bool_column==True, 1).otherwise(0))

This particular example converts the Boolean column named bool_column to an integer column named int_column.

Each of the values equal to True in the Boolean column will be shown as 1 in the integer column.

Similarly, each of the values equal to False in the Boolean column will be shown as 0 in the integer column.

The following example shows how to use this syntax in practice.

Example: Convert Boolean Column to Integer in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball teams:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18, True], 
        ['Nets', 33, True], 
        ['Lakers', 12, False], 
        ['Kings', 15, True], 
        ['Hawks', 19, False],
        ['Wizards', 24, False],
        ['Magic', 28, True],
        ['Jazz', 40, False],
        ['Thunder', 24, False],
        ['Spurs', 13, True]]
  
#define column names
columns = ['team', 'points', 'playoffs'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+--------+
|   team|points|playoffs|
+-------+------+--------+
|   Mavs|    18|    true|
|   Nets|    33|    true|
| Lakers|    12|   false|
|  Kings|    15|    true|
|  Hawks|    19|   false|
|Wizards|    24|   false|
|  Magic|    28|    true|
|   Jazz|    40|   false|
|Thunder|    24|   false|
|  Spurs|    13|    true|
+-------+------+--------+

The playoffs column is a Boolean column that contains the values true and false to indicate whether or not each team made the playoffs.

We can use the following syntax to create a new column called playoffs_int that converts each of the Boolean values of true and false to the integer values of 1 or 0:

from pyspark.sql.functions import when

#convert Boolean column to integer column
df_new = df.withColumn('playoffs_int', when(df.playoffs==True, 1).otherwise(0))

#view new DataFrame
df_new.show()

+-------+------+--------+------------+
|   team|points|playoffs|playoffs_int|
+-------+------+--------+------------+
|   Mavs|    18|    true|           1|
|   Nets|    33|    true|           1|
| Lakers|    12|   false|           0|
|  Kings|    15|    true|           1|
|  Hawks|    19|   false|           0|
|Wizards|    24|   false|           0|
|  Magic|    28|    true|           1|
|   Jazz|    40|   false|           0|
|Thunder|    24|   false|           0|
|  Spurs|    13|    true|           1|
+-------+------+--------+------------+

The new playoffs_int column now displays all true and false values from the playoffs column as either 1 or 0.

We can use the dtypes function to view the data type of each column in this new DataFrame and verify that the new column is indeed an integer column:

#display data type of each column
df_new.dtypes

[('team', 'string'),
 ('points', 'bigint'),
 ('playoffs', 'boolean'),
 ('playoffs_int', 'int')]

We can see that the new playoffs_int column is indeed an integer column.

x