Table of Contents

To get the last row from a PySpark DataFrame, one should use the take() method to retrieve the last row of the DataFrame. To do so, one should use the following code: df.take(df.count())[-1]. This will output the last row of the DataFrame.

You can use the following syntax to get the last row from a PySpark DataFrame:

from pyspark.sql.functions import *

#get last row of DataFrame
last_row = df.withColumn('id', monotonically_increasing_id())
             .select(max(struct('id', *df.columns))
             .alias('x')).select(col('x.*')).drop('id'))

The following example shows how to use this syntax so in practice.

Example: How to Get Last Row from PySpark DataFrame

Suppose we have the following PySpark DataFrame that contains information about basketball players on various teams:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Suppose we would like to get the last row from the DataFrame.

We can use the following syntax to do so:

from pyspark.sql.functions import *

#get last row of DataFrame
last_row = df.withColumn('id', monotonically_increasing_id())
             .select(max(struct('id', *df.columns))
             .alias('x')).select(col('x.*')).drop('id'))

#view last row
last_row.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   C|      East|     5|      2|
+----+----------+------+-------+

We have successfully extracted only the last row from the DataFrame.

Here is how this syntax worked in a nutshell:

First, we use the monotonically_increasing_id function to add a new column called id that contained monotonically increasing values.
Next, we used the max function to select the row with the largest id value, which is guaranteed to be the last row in the id column.
Lastly, we dropped the id column from the DataFrame.

The end result is that we were able to get only the last row from the DataFrame.

Note: You can find the complete documentation for the monotonically_increasing_id function .

The following tutorials explain how to perform other common tasks in PySpark:

How to Get The Last Row from PySpark DataFrame?

Example: How to Get Last Row from PySpark DataFrame

Requst a

Scale

Example: How to Get Last Row from PySpark DataFrame

Related terms:

Requst a

Scale