How can I use PySpark to find the unique values in a specific column of a dataframe?

PySpark is a powerful tool for data analysis and manipulation, especially when dealing with large datasets. One of its useful features is the ability to find unique values in a specific column of a dataframe. By using PySpark, you can easily extract and analyze distinct values in a specific column, providing valuable insights into your data. This can be achieved by using the distinct() function, which eliminates duplicate values and returns only the unique ones. With PySpark, you can efficiently identify and explore unique values in a specific column, helping you make informed decisions and gain a deeper understanding of your data.

PySpark: Find Unique Values in a Column


The easiest way to obtain a list of unique values in a PySpark DataFrame column is to use the distinct function.

This tutorial provides several examples of how to use this function with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11], 
        ['A', 'East', 8], 
        ['A', 'East', 10], 
        ['B', 'West', 6], 
        ['B', 'West', 6], 
        ['C', 'East', 5]] 
  
#define column names
columns = ['team', 'conference', 'points'] 
  
#create DataFrame using data and column names
df = spark.createDataFrame(data, columns) 
  
#view DataFrame
df.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     6|
|   B|      West|     6|
|   C|      East|     5|
+----+----------+------+

Example 1: Find Unique Values in a Column

We can use the following syntax to find the unique values in the team column of the DataFrame:

df.select('team').distinct().show()

+----+
|team|
+----+
|   A|
|   B|
|   C|
+----+

We can see that the unique values in the team column are A, B and C.

Example 2: Find and Sort Unique Values in a Column

Suppose we used the following syntax to find the unique values in the points column:

df.select('points').distinct().show()

+------+
|points|
+------+
|    11|
|     8|
|    10|
|     6|
|     5|
+------+

The output displays the unique values but they aren’t sorted in any way.

If we’d like, we can use the following syntax to find the unique values in the points column and return them sorted in ascending order:

#find unique values in points column
df_points = df.select('points').distinct()

#display unique values in ascending orderdf_points.orderBy('points').show()

+------+
|points|
+------+
|     5|
|     6|
|     8|
|    10|
|    11|
+------+

We can also use the argument ascending=False to return the unique values in descending order instead:

#find unique values in points column
df_points = df.select('points').distinct()

#display unique values in descending orderdf_points.orderBy('points', ascending=False).show()

+------+
|points|
+------+
|    11|
|    10|
|     8|
|     6|
|     5|
+------+

Example 3: Find and Count Unique Values in a Column

The following code shows how to find and count the occurrence of unique values in the team column of the DataFrame:

df.groupBy('team').count().show()

+----+-----+
|team|count|
+----+-----+
|   A|    3|
|   B|    2|
|   C|    1|
+----+-----+

From the output we can see the three unique values (A, B, C) along with the number of times each unique value occurs.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

x