PySpark: Drop Duplicate Rows from DataFrame


There are three common ways to drop duplicate rows from a PySpark DataFrame:

Method 1: Drop Rows with Duplicate Values Across All Columns

#drop rows that have duplicate values across all columns
df_new = df.dropDuplicates()

Method 2: Drop Rows with Duplicate Values Across Specific Columns

#drop rows that have duplicate values across 'team' and 'position' columns
df_new = df.dropDuplicates(['team', 'position'])

Method 3: Drop Rows with Duplicate Values in One Specific Column

#drop rows that have duplicate values in 'team' column
df_new = df.dropDuplicates(['team'])

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['B', 'Forward', 7]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

Example 1: Drop Rows with Duplicate Values Across All Columns

We can use the following syntax to drop rows that have duplicate values across all columns in the DataFrame:

#drop rows that have duplicate values across all columns
df_new = df.dropDuplicates()

#view DataFrame without duplicates
df_new.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

A total of two rows were dropped from the DataFrame.

Example 2: Drop Rows with Duplicate Values Across Specific Columns

We can use the following syntax to drop rows that have duplicate values across the team and position columns in the DataFrame:

#drop rows that have duplicate values across 'team' and 'position' columns
df_new = df.dropDuplicates(['team', 'position'])

#view DataFrame without duplicates
df_new.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B| Forward|    13|
+----+--------+------+

Notice that the resulting DataFrame has no rows with duplicate values across both the team and position columns.

Example 3: Drop Rows with Duplicate Values in One Specific Column

We can use the following syntax to drop rows that have duplicate values in the team column of the DataFrame:

#drop rows that have duplicate values in 'team' column
df_new = df.dropDuplicates(['team'])

#view DataFrame without duplicates
df_new.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   B|   Guard|    14|
+----+--------+------+

Notice that the resulting DataFrame has no rows with duplicate values in the team column.

Note: When duplicate rows are identified, only the first duplicate row is kept in the DataFrame while all other duplicate rows are dropped.

x