How can I add a new column with row numbers in PySpark?

To add a new column with row numbers in PySpark, you can use the built-in function “monotonically_increasing_id()” which assigns a unique and monotonically increasing number to each row in the DataFrame. This function can be used to create a new column containing row numbers, allowing for easy identification and manipulation of specific rows in the DataFrame.

PySpark: Add New Column with Row Numbers


You can use the following syntax to add a new column with row numbers to a PySpark DataFrame:

from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window

#add column called 'id' that contains row numbers from 1 to n
w = Window().orderBy(lit('A'))
df = df.withColumn('id', row_number().over(w))

This particular example will add a new column called id that contains row numbers ranging from 1 to n for the DataFrame.

The following example shows how to do so in practice.

Example: How to Add New Column with Row Numbers to PySpark DataFrame

Suppose we create the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11], 
        ['A', 'East', 8], 
        ['A', 'East', 10], 
        ['B', 'West', 6], 
        ['B', 'West', 6], 
        ['C', 'East', 5]] 
  
#define column names
columns = ['team', 'conference', 'points'] 
  
#create DataFrame using data and column names
df = spark.createDataFrame(data, columns) 
  
#view DataFrame
df.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     6|
|   B|      West|     6|
|   C|      East|     5|
+----+----------+------+

The DataFrame currently contains three columns.

Suppose we would like to add a new column that contains row numbers.

We can use the following syntax to do so:

from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window

#add column called 'id' that contains row numbers from 1 to n
w = Window().orderBy(lit('A'))
df = df.withColumn('id', row_number().over(w))

#view updated DataFrame
df.show()

+----+----------+------+---+
|team|conference|points| id|
+----+----------+------+---+
|   A|      East|    11|  1|
|   A|      East|     8|  2|
|   A|      East|    10|  3|
|   B|      West|     6|  4|
|   B|      West|     6|  5|
|   C|      East|     5|  6|
+----+----------+------+---+

Notice that the new id column contains row numbers ranging from 1 to 6.

If you’d like, you can reorder the columns so that the id column appears at the front:

#move 'id' column to front
df = df.select('id', 'team', 'conference', 'points')#view updated DataFrame
df.show()

+---+----+----------+------+
| id|team|conference|points|
+---+----+----------+------+
|  1|   A|      East|    11|
|  2|   A|      East|     8|
|  3|   A|      East|    10|
|  4|   B|      West|     6|
|  5|   B|      West|     6|
|  6|   C|      East|     5|
+---+----+----------+------+

Note: The syntax lit(‘A’) is simply used as an arbitrary value. You can replace ‘A’ with anything you’d like and the code will still work.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x