Table of Contents
To add a new column with row numbers in PySpark, you can use the built-in function “monotonically_increasing_id()” which assigns a unique and monotonically increasing number to each row in the DataFrame. This function can be used to create a new column containing row numbers, allowing for easy identification and manipulation of specific rows in the DataFrame.
PySpark: Add New Column with Row Numbers
You can use the following syntax to add a new column with row numbers to a PySpark DataFrame:
from pyspark.sql.functions import row_number,lit from pyspark.sql.window import Window #add column called 'id' that contains row numbers from 1 to n w = Window().orderBy(lit('A')) df = df.withColumn('id', row_number().over(w))
This particular example will add a new column called id that contains row numbers ranging from 1 to n for the DataFrame.
The following example shows how to do so in practice.
Example: How to Add New Column with Row Numbers to PySpark DataFrame
Suppose we create the following PySpark DataFrame:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'East', 11], ['A', 'East', 8], ['A', 'East', 10], ['B', 'West', 6], ['B', 'West', 6], ['C', 'East', 5]] #define column names columns = ['team', 'conference', 'points'] #create DataFrame using data and column names df = spark.createDataFrame(data, columns) #view DataFrame df.show() +----+----------+------+ |team|conference|points| +----+----------+------+ | A| East| 11| | A| East| 8| | A| East| 10| | B| West| 6| | B| West| 6| | C| East| 5| +----+----------+------+
The DataFrame currently contains three columns.
Suppose we would like to add a new column that contains row numbers.
We can use the following syntax to do so:
from pyspark.sql.functions import row_number,lit from pyspark.sql.window import Window #add column called 'id' that contains row numbers from 1 to n w = Window().orderBy(lit('A')) df = df.withColumn('id', row_number().over(w)) #view updated DataFrame df.show() +----+----------+------+---+ |team|conference|points| id| +----+----------+------+---+ | A| East| 11| 1| | A| East| 8| 2| | A| East| 10| 3| | B| West| 6| 4| | B| West| 6| 5| | C| East| 5| 6| +----+----------+------+---+
Notice that the new id column contains row numbers ranging from 1 to 6.
If you’d like, you can reorder the columns so that the id column appears at the front:
#move 'id' column to front
df = df.select('id', 'team', 'conference', 'points')#view updated DataFrame
df.show()
+---+----+----------+------+
| id|team|conference|points|
+---+----+----------+------+
| 1| A| East| 11|
| 2| A| East| 8|
| 3| A| East| 10|
| 4| B| West| 6|
| 5| B| West| 6|
| 6| C| East| 5|
+---+----+----------+------+
Note: The syntax lit(‘A’) is simply used as an arbitrary value. You can replace ‘A’ with anything you’d like and the code will still work.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark: