How to Get The Last Row from PySpark DataFrame?

To get the last row from a PySpark DataFrame, one should use the take() method to retrieve the last row of the DataFrame. To do so, one should use the following code: df.take(df.count())[-1]. This will output the last row of the DataFrame.


You can use the following syntax to get the last row from a PySpark DataFrame:

from pyspark.sql.functions import *

#get last row of DataFrame
last_row = df.withColumn('id', monotonically_increasing_id())
             .select(max(struct('id', *df.columns))
             .alias('x')).select(col('x.*')).drop('id'))

The following example shows how to use this syntax so in practice.

Example: How to Get Last Row from PySpark DataFrame

Suppose we have the following PySpark DataFrame that contains information about basketball players on various teams:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Suppose we would like to get the last row from the DataFrame.

We can use the following syntax to do so:

from pyspark.sql.functions import *

#get last row of DataFrame
last_row = df.withColumn('id', monotonically_increasing_id())
             .select(max(struct('id', *df.columns))
             .alias('x')).select(col('x.*')).drop('id'))

#view last row
last_row.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   C|      East|     5|      2|
+----+----------+------+-------+

We have successfully extracted only the last row from the DataFrame.

Here is how this syntax worked in a nutshell:

  • First, we use the monotonically_increasing_id function to add a new column called id that contained monotonically increasing values.
  • Next, we used the max function to select the row with the largest id value, which is guaranteed to be the last row in the id column.
  • Lastly, we dropped the id column from the DataFrame.

The end result is that we were able to get only the last row from the DataFrame.

Note: You can find the complete documentation for the monotonically_increasing_id function .

The following tutorials explain how to perform other common tasks in PySpark:

x