How can I convert a string to an integer in PySpark, and can you provide an example?

To convert a string to an integer in PySpark, the “cast” function can be used. This function takes in the string column and the data type to be converted to as parameters. For example, to convert a string column “age” to an integer column, the code would be: “df.withColumn(‘age’, df[‘age’].cast(“integer”))”. This will convert all the values in the “age” column from string to integer.

Convert String to Integer in PySpark (With Example)


You can use the following syntax to convert a string column to an integer column in a PySpark DataFrame:

from pyspark.sql.types import IntegerType

df = df.withColumn('my_integer', df['my_string'].cast(IntegerType()))

This particular example creates a new column called my_integer that contains the integer values from the string values in the my_string column.

The following example shows how to use this syntax in practice.

Example: How to Convert String to Integer in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', '11'], 
        ['B', '19'], 
        ['C', '22'], 
        ['D', '25'], 
        ['E', '12'], 
        ['F', '41'],
        ['G', '32'],
        ['H', '20']] 
  
#define column names
columns = ['team', 'points']
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+------+
|team|points|
+----+------+
|   A|    11|
|   B|    19|
|   C|    22|
|   D|    25|
|   E|    12|
|   F|    41|
|   G|    32|
|   H|    20|
+----+------+

We can use the following syntax to display the data type of each column in the DataFrame:

#check data type of each column
df.dtypes

[('team', 'string'), ('points', 'string')]

We can see that the points column currently has a data type of string.

To convert this column from a string to an integer, we can use the following syntax:

from pyspark.sql.types import IntegerType

#create integer column from string column
df = df.withColumn('points_integer', df['points'].cast(IntegerType()))

#view updated DataFrame
df.show()

+----+------+--------------+
|team|points|points_integer|
+----+------+--------------+
|   A|    11|            11|
|   B|    19|            19|
|   C|    22|            22|
|   D|    25|            25|
|   E|    12|            12|
|   F|    41|            41|
|   G|    32|            32|
|   H|    20|            20|
+----+------+--------------+

We can use the dtypes function once again to view the data types of each column in the DataFrame:

#check data type of each column
df.dtypes

[('team', 'string'), ('points', 'string'), ('points_integer', 'int')]

We can see that the points_integer column has a data type of int.

We have successfully created an integer column from a string column.

Additional Resources

x