How to Create a Duplicate Column in PySpark DataFrame?

In PySpark DataFrames, the withColumn() function can be used to create a duplicate column of an existing column. This method takes two arguments, the first being the name of the column to be created, and the second being an expression that is based on the existing column. This expression can involve mathematical operations or other transformations to produce the desired result in the new column. This method is an efficient way to create additional columns in a DataFrame without the need for manual data manipulation.


You can use the following basic syntax to create a duplicate column in a PySpark DataFrame:

df_new = df.withColumn('my_duplicate_column', df['original_column'])

The following example shows how to use this syntax in practice.

Example: How to Create Duplicate Column in PySpark DataFrame

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11, 5],
        ['A', 'Guard', 8, 4],
        ['A', 'Forward', 22, 3],
        ['A', 'Forward', 22, 6],
        ['B', 'Guard', 14, 3],
        ['B', 'Guard', 14, 5],
        ['B', 'Forward', 13, 7],
        ['B', 'Forward', 14, 8],
        ['C', 'Forward', 23, 2],
        ['C', 'Guard', 30, 5]]
  
#define column names
columns = ['team', 'position', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+-------+
|team|position|points|assists|
+----+--------+------+-------+
|   A|   Guard|    11|      5|
|   A|   Guard|     8|      4|
|   A| Forward|    22|      3|
|   A| Forward|    22|      6|
|   B|   Guard|    14|      3|
|   B|   Guard|    14|      5|
|   B| Forward|    13|      7|
|   B| Forward|    14|      8|
|   C| Forward|    23|      2|
|   C|   Guard|    30|      5|
+----+--------+------+-------+

We can use the following code to create a duplicate of the points column and name it points_duplicate:

#create duplicate of 'points' column
df_new = df.withColumn('points_duplicate', df['points'])

#view new DataFrame
df_new.show()

+----+--------+------+-------+----------------+
|team|position|points|assists|points_duplicate|
+----+--------+------+-------+----------------+
|   A|   Guard|    11|      5|              11|
|   A|   Guard|     8|      4|               8|
|   A| Forward|    22|      3|              22|
|   A| Forward|    22|      6|              22|
|   B|   Guard|    14|      3|              14|
|   B|   Guard|    14|      5|              14|
|   B| Forward|    13|      7|              13|
|   B| Forward|    14|      8|              14|
|   C| Forward|    23|      2|              23|
|   C|   Guard|    30|      5|              30|
+----+--------+------+-------+----------------+

Notice that the points_duplicate column contains the exact same values as the points column.

Note that the duplicate column must have a different name than the original column, or else a duplicate column will not be created.

For example, if we attempt to use the following code to create a duplicate column, it won’t work:

#attempt to create duplicate points column
df_new = df.withColumn('points', df['points'])

#view new DataFrame
df_new.show()

+----+--------+------+-------+
|team|position|points|assists|
+----+--------+------+-------+
|   A|   Guard|    11|      5|
|   A|   Guard|     8|      4|
|   A| Forward|    22|      3|
|   A| Forward|    22|      6|
|   B|   Guard|    14|      3|
|   B|   Guard|    14|      5|
|   B| Forward|    13|      7|
|   B| Forward|    14|      8|
|   C| Forward|    23|      2|
|   C|   Guard|    30|      5|
+----+--------+------+-------+

No duplicate column was created.

The duplicate column must have a different name than the original column.

Note: You can find the complete documentation for the PySpark withColumn function .

The following tutorials explain how to perform other common tasks in PySpark:

x