How can I create a duplicate column in a PySpark DataFrame?

To create a duplicate column in a PySpark DataFrame, you can use the `withColumn()` function and specify the name of the new column and the column you want to duplicate. This will create a new column with the same values as the original column. This can be useful for performing operations on the same column without affecting the original data, or for creating a backup copy of a column.

Create a Duplicate Column in PySpark DataFrame


You can use the following basic syntax to create a duplicate column in a PySpark DataFrame:

df_new = df.withColumn('my_duplicate_column', df['original_column'])

The following example shows how to use this syntax in practice.

Example: How to Create Duplicate Column in PySpark DataFrame

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11, 5],
        ['A', 'Guard', 8, 4],
        ['A', 'Forward', 22, 3],
        ['A', 'Forward', 22, 6],
        ['B', 'Guard', 14, 3],
        ['B', 'Guard', 14, 5],
        ['B', 'Forward', 13, 7],
        ['B', 'Forward', 14, 8],
        ['C', 'Forward', 23, 2],
        ['C', 'Guard', 30, 5]]
  
#define column names
columns = ['team', 'position', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+-------+
|team|position|points|assists|
+----+--------+------+-------+
|   A|   Guard|    11|      5|
|   A|   Guard|     8|      4|
|   A| Forward|    22|      3|
|   A| Forward|    22|      6|
|   B|   Guard|    14|      3|
|   B|   Guard|    14|      5|
|   B| Forward|    13|      7|
|   B| Forward|    14|      8|
|   C| Forward|    23|      2|
|   C|   Guard|    30|      5|
+----+--------+------+-------+

We can use the following code to create a duplicate of the points column and name it points_duplicate:

#create duplicate of 'points' column
df_new = df.withColumn('points_duplicate', df['points'])

#view new DataFrame
df_new.show()

+----+--------+------+-------+----------------+
|team|position|points|assists|points_duplicate|
+----+--------+------+-------+----------------+
|   A|   Guard|    11|      5|              11|
|   A|   Guard|     8|      4|               8|
|   A| Forward|    22|      3|              22|
|   A| Forward|    22|      6|              22|
|   B|   Guard|    14|      3|              14|
|   B|   Guard|    14|      5|              14|
|   B| Forward|    13|      7|              13|
|   B| Forward|    14|      8|              14|
|   C| Forward|    23|      2|              23|
|   C|   Guard|    30|      5|              30|
+----+--------+------+-------+----------------+

Notice that the points_duplicate column contains the exact same values as the points column.

Note that the duplicate column must have a different name than the original column, or else a duplicate column will not be created.

For example, if we attempt to use the following code to create a duplicate column, it won’t work:

#attempt to create duplicate points column
df_new = df.withColumn('points', df['points'])

#view new DataFrame
df_new.show()

+----+--------+------+-------+
|team|position|points|assists|
+----+--------+------+-------+
|   A|   Guard|    11|      5|
|   A|   Guard|     8|      4|
|   A| Forward|    22|      3|
|   A| Forward|    22|      6|
|   B|   Guard|    14|      3|
|   B|   Guard|    14|      5|
|   B| Forward|    13|      7|
|   B| Forward|    14|      8|
|   C| Forward|    23|      2|
|   C|   Guard|    30|      5|
+----+--------+------+-------+

No duplicate column was created.

The duplicate column must have a different name than the original column.

Note: You can find the complete documentation for the PySpark withColumn function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x