How can I retrieve the last row from a DataFrame in PySpark?

The process of retrieving the last row from a DataFrame in PySpark involves using the `orderBy()` function to sort the DataFrame in descending order, and then using the `head()` function to retrieve the first row, which would be the last row in the sorted DataFrame. This method ensures that the last row is always retrieved, regardless of the number of rows in the DataFrame. Additionally, the `orderBy()` function allows for selecting a specific column to sort by, providing flexibility in retrieving the desired last row.

PySpark: Get Last Row from DataFrame


You can use the following syntax to get the last row from a PySpark DataFrame:

from pyspark.sql.functions import*#get last row of DataFrame
last_row = df.withColumn('id', monotonically_increasing_id())
             .select(max(struct('id', *df.columns))
             .alias('x')).select(col('x.*')).drop('id'))

The following example shows how to use this syntax so in practice.

Example: How to Get Last Row from PySpark DataFrame

Suppose we have the following PySpark DataFrame that contains information about basketball players on various teams:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Suppose we would like to get the last row from the DataFrame.

We can use the following syntax to do so:

from pyspark.sql.functions import*#get last row of DataFrame
last_row = df.withColumn('id', monotonically_increasing_id())
             .select(max(struct('id', *df.columns))
             .alias('x')).select(col('x.*')).drop('id'))

#view last row
last_row.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   C|      East|     5|      2|
+----+----------+------+-------+

We have successfully extracted only the last row from the DataFrame.

Here is how this syntax worked in a nutshell:

  • First, we use the monotonically_increasing_id function to add a new column called id that contained monotonically increasing values.
  • Next, we used the max function to select the row with the largest id value, which is guaranteed to be the last row in the id column.
  • Lastly, we dropped the id column from the DataFrame.

The end result is that we were able to get only the last row from the DataFrame.

Note: You can find the complete documentation for the monotonically_increasing_id function .

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x