PySpark: Find Unique Values in a Column


The easiest way to obtain a list of unique values in a PySpark DataFrame column is to use the distinct function.

This tutorial provides several examples of how to use this function with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11], 
        ['A', 'East', 8], 
        ['A', 'East', 10], 
        ['B', 'West', 6], 
        ['B', 'West', 6], 
        ['C', 'East', 5]] 
  
#define column names
columns = ['team', 'conference', 'points'] 
  
#create DataFrame using data and column names
df = spark.createDataFrame(data, columns) 
  
#view DataFrame
df.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     6|
|   B|      West|     6|
|   C|      East|     5|
+----+----------+------+

Example 1: Find Unique Values in a Column

We can use the following syntax to find the unique values in the team column of the DataFrame:

df.select('team').distinct().show()

+----+
|team|
+----+
|   A|
|   B|
|   C|
+----+

We can see that the unique values in the team column are A, B and C.

Example 2: Find and Sort Unique Values in a Column

Suppose we used the following syntax to find the unique values in the points column:

df.select('points').distinct().show()

+------+
|points|
+------+
|    11|
|     8|
|    10|
|     6|
|     5|
+------+

The output displays the unique values but they aren’t sorted in any way.

If we’d like, we can use the following syntax to find the unique values in the points column and return them sorted in ascending order:

#find unique values in points column
df_points = df.select('points').distinct()

#display unique values in ascending order
df_points.orderBy('points').show()

+------+
|points|
+------+
|     5|
|     6|
|     8|
|    10|
|    11|
+------+

We can also use the argument ascending=False to return the unique values in descending order instead:

#find unique values in points column
df_points = df.select('points').distinct()

#display unique values in descending order
df_points.orderBy('points', ascending=False).show()

+------+
|points|
+------+
|    11|
|    10|
|     8|
|     6|
|     5|
+------+

Example 3: Find and Count Unique Values in a Column

The following code shows how to find and count the occurrence of unique values in the team column of the DataFrame:

df.groupBy('team').count().show()

+----+-----+
|team|count|
+----+-----+
|   A|    3|
|   B|    2|
|   C|    1|
+----+-----+

From the output we can see the three unique values (A, B, C) along with the number of times each unique value occurs.

x