How do I select the first row of each group from a PySpark Dataframe?


You can use the following syntax to select the first row by group in a PySpark DataFrame:

from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window

#group DataFrame by team column
w = Window.partitionBy('team').orderBy(lit('A'))

#filter DataFrame to only show first row for each team
df.withColumn('row',row_number().over(w)).filter(col('row') == 1).drop('row').show()

This particular example selects the first row for each unique team value in the DataFrame.

The following example shows how to use this syntax in practice.

Example: How to Select First Row of Each Group in PySpark DataFrame

Suppose we have the following PySpark DataFrame that contains information about basketball players on various teams:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11],
        ['A', 'Guard', 8],
        ['A', 'Forward', 22],
        ['A', 'Forward', 22],
        ['B', 'Guard', 14],
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['B', 'Forward', 14],
        ['C', 'Forward', 23],
        ['C', 'Guard', 30]]
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|    14|
|   C| Forward|    23|
|   C|   Guard|    30|
+----+--------+------+

Suppose we would like to select the first row for each unique team.

We can use the following syntax to do so:

from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window

#group DataFrame by team column
w = Window.partitionBy('team').orderBy(lit('A'))

#filter DataFrame to only show first row for each team
df.withColumn('row',row_number().over(w)).filter(col('row') == 1).drop('row').show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   B|   Guard|    14|
|   C| Forward|    23|
+----+--------+------+

The resulting DataFrame shows only the first row for each unique team value.

Note #1: If you would like to group by multiple columns, simply include multiple column names in the partitionBy function.

Note #2: The syntax lit(‘A’) is simply used as an arbitrary value. You can replace ‘A’ with anything you’d like and the code will still work.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

 

 

x