Table of Contents
There are three common ways to drop duplicate rows from a PySpark DataFrame:
Method 1: Drop Rows with Duplicate Values Across All Columns
#drop rows that have duplicate values across all columns
df_new = df.dropDuplicates()
Method 2: Drop Rows with Duplicate Values Across Specific Columns
#drop rows that have duplicate values across 'team' and 'position' columns df_new = df.dropDuplicates(['team', 'position'])
Method 3: Drop Rows with Duplicate Values in One Specific Column
#drop rows that have duplicate values in 'team' column df_new = df.dropDuplicates(['team'])
The following examples show how to use each method in practice with the following PySpark DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'Guard', 11],
['A', 'Guard', 8],
['A', 'Forward', 22],
['A', 'Forward', 22],
['B', 'Guard', 14],
['B', 'Guard', 14],
['B', 'Forward', 13],
['B', 'Forward', 7]]
#define column names
columns = ['team', 'position', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+--------+------+
|team|position|points|
+----+--------+------+
| A| Guard| 11|
| A| Guard| 8|
| A| Forward| 22|
| A| Forward| 22|
| B| Guard| 14|
| B| Guard| 14|
| B| Forward| 13|
| B| Forward| 7|
+----+--------+------+
Example 1: Drop Rows with Duplicate Values Across All Columns
We can use the following syntax to drop rows that have duplicate values across all columns in the DataFrame:
#drop rows that have duplicate values across all columns
df_new = df.dropDuplicates()
#view DataFrame without duplicates
df_new.show()
+----+--------+------+
|team|position|points|
+----+--------+------+
| A| Guard| 11|
| A| Guard| 8|
| A| Forward| 22|
| B| Guard| 14|
| B| Forward| 13|
| B| Forward| 7|
+----+--------+------+
A total of two rows were dropped from the DataFrame.
Example 2: Drop Rows with Duplicate Values Across Specific Columns
We can use the following syntax to drop rows that have duplicate values across the team and position columns in the DataFrame:
#drop rows that have duplicate values across 'team' and 'position' columns
df_new = df.dropDuplicates(['team', 'position'])
#view DataFrame without duplicates
df_new.show()
+----+--------+------+
|team|position|points|
+----+--------+------+
| A| Guard| 11|
| A| Forward| 22|
| B| Guard| 14|
| B| Forward| 13|
+----+--------+------+
Notice that the resulting DataFrame has no rows with duplicate values across both the team and position columns.
Example 3: Drop Rows with Duplicate Values in One Specific Column
We can use the following syntax to drop rows that have duplicate values in the team column of the DataFrame:
#drop rows that have duplicate values in 'team' column
df_new = df.dropDuplicates(['team'])
#view DataFrame without duplicates
df_new.show()
+----+--------+------+
|team|position|points|
+----+--------+------+
| A| Guard| 11|
| B| Guard| 14|
+----+--------+------+
Notice that the resulting DataFrame has no rows with duplicate values in the team column.
Note: When duplicate rows are identified, only the first duplicate row is kept in the DataFrame while all other duplicate rows are dropped.