Table of Contents
In PySpark, columns can be renamed using the `withColumnRenamed()` function. This function takes two arguments: the original column name and the new column name. It returns a new DataFrame with the renamed column. For example, if we have a DataFrame named `df` with columns “id” and “name”, we can rename the “name” column to “full_name” using the code `df.withColumnRenamed(“name”, “full_name”)`. This function can be useful when working with large datasets and wanting to make the column names more descriptive or standardized.
You can use the following methods to rename columns in a PySpark DataFrame:
Method 1: Rename One Column
#rename 'conference' column to 'conf' df = df.withColumnRenamed('conference', 'conf')
Method 2: Rename Multiple Columns
#rename 'conference' and 'team' columns df = df.withColumnRenamed('conference', 'conf') .withColumnRenamed('team', 'team_name')
Method 3: Rename All Columns
#specify new column names to use col_names = ['the_team', 'the_conf', 'points_scored', 'total_assists'] #rename all column names with new names df = df.toDF(*col_names)
The following examples show how to use each of these methods in practice with the following PySpark DataFrame:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'East', 11, 4], ['A', 'East', 8, 9], ['A', 'East', 10, 3], ['B', 'West', 6, 12], ['B', 'West', 6, 4], ['C', 'East', 5, 2]] #define column names columns = ['team', 'conference', 'points', 'assists'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +----+----------+------+-------+ |team|conference|points|assists| +----+----------+------+-------+ | A| East| 11| 4| | A| East| 8| 9| | A| East| 10| 3| | B| West| 6| 12| | B| West| 6| 4| | C| East| 5| 2| +----+----------+------+-------+
Example 1: Rename One Column in PySpark
We can use the following syntax to rename just the conference column in the DataFrame:
#rename 'conference' column to 'conf' df = df.withColumnRenamed('conference', 'conf') #view updated DataFrame df.show() +----+----+------+-------+ |team|conf|points|assists| +----+----+------+-------+ | A|East| 11| 4| | A|East| 8| 9| | A|East| 10| 3| | B|West| 6| 12| | B|West| 6| 4| | C|East| 5| 2| +----+----+------+-------+
Notice that only the conference column has been renamed.
Example 2: Rename Multiple Columns in PySpark
We can use the following syntax to rename the conference and team columns in the DataFrame:
#rename 'conference' and 'team' columns df = df.withColumnRenamed('conference', 'conf') .withColumnRenamed('team', 'team_name') #view updated DataFrame df.show() +---------+----+------+-------+ |team_name|conf|points|assists| +---------+----+------+-------+ | A|East| 11| 4| | A|East| 8| 9| | A|East| 10| 3| | B|West| 6| 12| | B|West| 6| 4| | C|East| 5| 2| +---------+----+------+-------+
Notice that the conference and team columns have been renamed while all other column names have remained the same.
Example 3: Rename All Columns in PySpark
We can use the following syntax to rename all columns in the DataFrame:
#specify new column names to use col_names = ['the_team', 'the_conf', 'points_scored', 'total_assists'] #rename all column names with new names df = df.toDF(*col_names) #view updated DataFrame df.show() +--------+--------+-------------+-------------+ |the_team|the_conf|points_scored|total_assists| +--------+--------+-------------+-------------+ | A| East| 11| 4| | A| East| 8| 9| | A| East| 10| 3| | B| West| 6| 12| | B| West| 6| 4| | C| East| 5| 2| +--------+--------+-------------+-------------+
Notice that all of the column names have been renamed based on the new names that we specified.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark: