How can I drop duplicate rows from a DataFrame in PySpark?

Dropping duplicate rows from a DataFrame in PySpark refers to the process of removing rows with identical values in all columns from a PySpark DataFrame. This can be achieved by using the dropDuplicates() function, which compares all columns in a DataFrame and removes any duplicate rows. This method is useful for data cleaning and ensuring data integrity in large datasets. It can be easily implemented in PySpark and can help improve the accuracy and efficiency of data analysis and processing.

PySpark: Drop Duplicate Rows from DataFrame


There are three common ways to drop duplicate rows from a PySpark DataFrame:

Method 1: Drop Rows with Duplicate Values Across All Columns

#drop rows that have duplicate values across all columns
df_new = df.dropDuplicates()

Method 2: Drop Rows with Duplicate Values Across Specific Columns

#drop rows that have duplicate values across 'team' and 'position' columns
df_new = df.dropDuplicates(['team', 'position'])

Method 3: Drop Rows with Duplicate Values in One Specific Column

#drop rows that have duplicate values in 'team' column
df_new = df.dropDuplicates(['team'])

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['B', 'Forward', 7]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

Example 1: Drop Rows with Duplicate Values Across All Columns

We can use the following syntax to drop rows that have duplicate values across all columns in the DataFrame:

#drop rows that have duplicate values across all columns
df_new = df.dropDuplicates()

#view DataFrame without duplicates
df_new.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

A total of two rows were dropped from the DataFrame.

Example 2: Drop Rows with Duplicate Values Across Specific Columns

We can use the following syntax to drop rows that have duplicate values across the team and position columns in the DataFrame:

#drop rows that have duplicate values across 'team' and 'position' columns
df_new = df.dropDuplicates(['team', 'position'])

#view DataFrame without duplicates
df_new.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B| Forward|    13|
+----+--------+------+

Notice that the resulting DataFrame has no rows with duplicate values across both the team and position columns.

Example 3: Drop Rows with Duplicate Values in One Specific Column

We can use the following syntax to drop rows that have duplicate values in the team column of the DataFrame:

#drop rows that have duplicate values in 'team' column
df_new = df.dropDuplicates(['team'])

#view DataFrame without duplicates
df_new.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   B|   Guard|    14|
+----+--------+------+

Notice that the resulting DataFrame has no rows with duplicate values in the team column.

Note: When duplicate rows are identified, only the first duplicate row is kept in the DataFrame while all other duplicate rows are dropped.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x