How can I use the cast() function in PySpark to convert data types for multiple columns at once?

The cast() function in PySpark allows users to easily convert data types for multiple columns at once. This function takes in the column name and the desired data type as parameters, and applies the conversion to the specified columns. This allows for efficient and streamlined data manipulation, as it eliminates the need for multiple separate conversions. By using the cast() function, users can easily customize and transform their data to fit their specific needs.

PySpark: Use cast() with Multiple Columns


You can use the PySpark cast function to convert a column to a specific dataType.

To use cast with multiple columns at once, you can use the following syntax:

my_cols = ['points', 'assists']

for x in my_cols:
    df = df.withColumn(x, col(x).cast('string'))

This particular example casts both the points and assists columns in the DataFrame to a string, while leaving the dataType of all other columns in the DataFrame unchanged.

The following example shows how to use this syntax in practice.

Example: How to Use cast() with Multiple Columns in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

We can use the following syntax to display the data type of each column in the DataFrame:

#check data type of each column
df.dtypes

[('team', 'string'),
 ('conference', 'string'),
 ('points', 'bigint'),
 ('assists', 'bigint')]

We can see that both the points and assists columns currently have a data type of integer.

We can use the following syntax to convert the dataType of both of these columns to strings:

#specify columns to convert to different dataType
my_cols = ['points', 'assists']

#convert dataType of each column in list to string
for x in my_cols:
    df = df.withColumn(x, col(x).cast('string'))

#view DataFrame
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

We can use the dtypes function once again to view the data types of each column in the DataFrame:

#check data type of each column
df.dtypes

[('team', 'string'),
 ('conference', 'string'),
 ('points', 'string'),
 ('assists', 'string')]

We can see that the points and assists columns have both been converted to a data type of string.

Also note that the dataType for the team and conference columns have remained unchanged.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x