How can I convert a column from boolean to integer in PySpark?

Converting a column from boolean to integer in PySpark can be achieved by using the `cast()` function. This function allows you to change the data type of a column in a DataFrame. To convert a boolean column to integer, you can specify the `IntegerType()` as the desired data type in the `cast()` function. This will convert the boolean values into 0s and 1s, representing false and true respectively. By utilizing this method, you can easily convert a boolean column into integer in PySpark.

PySpark: Convert Column from Boolean to Integer


You can use the following syntax to convert a column from a Boolean to an integer in PySpark:

from pyspark.sql.functions import when

#convert Boolean column to integer column
df_new = df.withColumn('int_column', when(df.bool_column==True, 1).otherwise(0))

This particular example converts the Boolean column named bool_column to an integer column named int_column.

Each of the values equal to True in the Boolean column will be shown as 1 in the integer column.

Similarly, each of the values equal to False in the Boolean column will be shown as 0 in the integer column.

The following example shows how to use this syntax in practice.

Example: Convert Boolean Column to Integer in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball teams:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18, True], 
        ['Nets', 33, True], 
        ['Lakers', 12, False], 
        ['Kings', 15, True], 
        ['Hawks', 19, False],
        ['Wizards', 24, False],
        ['Magic', 28, True],
        ['Jazz', 40, False],
        ['Thunder', 24, False],
        ['Spurs', 13, True]]
  
#define column names
columns = ['team', 'points', 'playoffs'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+--------+
|   team|points|playoffs|
+-------+------+--------+
|   Mavs|    18|    true|
|   Nets|    33|    true|
| Lakers|    12|   false|
|  Kings|    15|    true|
|  Hawks|    19|   false|
|Wizards|    24|   false|
|  Magic|    28|    true|
|   Jazz|    40|   false|
|Thunder|    24|   false|
|  Spurs|    13|    true|
+-------+------+--------+

The playoffs column is a Boolean column that contains the values true and false to indicate whether or not each team made the playoffs.

We can use the following syntax to create a new column called playoffs_int that converts each of the Boolean values of true and false to the integer values of 1 or 0:

from pyspark.sql.functions import when

#convert Boolean column to integer column
df_new = df.withColumn('playoffs_int', when(df.playoffs==True, 1).otherwise(0))

#view new DataFrame
df_new.show()

+-------+------+--------+------------+
|   team|points|playoffs|playoffs_int|
+-------+------+--------+------------+
|   Mavs|    18|    true|           1|
|   Nets|    33|    true|           1|
| Lakers|    12|   false|           0|
|  Kings|    15|    true|           1|
|  Hawks|    19|   false|           0|
|Wizards|    24|   false|           0|
|  Magic|    28|    true|           1|
|   Jazz|    40|   false|           0|
|Thunder|    24|   false|           0|
|  Spurs|    13|    true|           1|
+-------+------+--------+------------+

The new playoffs_int column now displays all true and false values from the playoffs column as either 1 or 0.

We can use the dtypes function to view the data type of each column in this new DataFrame and verify that the new column is indeed an integer column:

#display data type of each column
df_new.dtypes

[('team', 'string'),
 ('points', 'bigint'),
 ('playoffs', 'boolean'),
 ('playoffs_int', 'int')]

We can see that the new playoffs_int column is indeed an integer column.

Additional Resources

x