What is the Rand Index?

The Rand Index is a measure of the similarity of two clusters. It is calculated by taking the number of pairs of data points that are assigned to the same clusters in both partitions, and dividing it by the total number of pairs of data points. It can range from 0 (no similarity) to 1 (perfect similarity). It is often used in cluster analysis and in machine learning to measure the similarity of two partitions.


The Rand index is a way to compare the similarity of results between two different clustering methods.

Often denoted R, the Rand Index is calculated as:

R = (a+b) / (nC2)

where:

  • a: The number of times a pair of elements belongs to the same cluster across two clustering methods.
  • b: The number of times a pair of elements belong to difference clusters across two clustering methods.
  • nC2: The number of unordered pairs in a set of n elements.

The Rand index always takes on a value between 0 and 1 where:

  • 0: Indicates that two clustering methods do not agree on the clustering of any pair of elements.
  • 1: Indicates that two clustering methods perfectly agree on the clustering of every pair of elements.

The following example illustrates how to calculate the Rand index between two clustering methods for a simple dataset.

Example: How to Calculate the Rand Index

Suppose we have the following dataset of five elements:

  • Dataset: {A, B, C, D, E}

And suppose we use two clustering methods that place each element in the following clusters:

  • Method 1 Clusters: {1, 1, 1, 2, 2}
  • Method 2 Clusters: {1, 1, 2, 2, 3}

To calculate the Rand index between these clustering methods, we need to first write out every possible unordered pair in the dataset of five elements:

  • Unordered pairs: {A, B}, {A, C}, {A, D}, {A, E}, {B, C}, {B, D}, {B, E}, {C, D}, {C, E}, {D, E}

There are 10 unordered pairs.

Next, we need to calculate a, which represents the number of unordered pairs that belong to the same cluster across both clustering methods:

  • {A, B}

In this case, a = 1.

Next, we need to calculate b, which represents the number of unordered pairs that belong to different clusters across both clustering methods:

  • {A, D}, {A, E}, {B, D}, {B, E}, {C, E}

In this case, b = 5.

Lastly, we can calculate the Rand index as:

  • R = (a+b) / (nC2)
  • R = (1+5) / 10
  • R = 6/10

The Rand index is 0.6.

How to Calculate the Rand Index in R

We can use the rand.index() function from the fossil package to calculate the Rand index between two clustering methods in R:

library(fossil)

#define clusters
method1 <- c(1, 1, 1, 2, 2)
method2 <- c(1, 1, 2, 2, 3)

#calculate Rand index between clustering methods
rand.index(method1, method2)

[1] 0.6

The Rand index is 0.6. This matches the value that we calculated by hand.

How to Calculate the Rand Index in Python

We can define the following function in Python to calculate the Rand index between two clusters:

import numpy as np
from scipy.special import comb

#define Rand index function
def rand_index(actual, pred):

    tp_plus_fp = comb(np.bincount(actual), 2).sum()
    tp_plus_fn = comb(np.bincount(pred), 2).sum()
    A = np.c_[(actual, pred)]
    tp = sum(comb(np.bincount(A[A[:, 0] == i, 1]), 2).sum()
             for i in set(actual))
    fp = tp_plus_fp - tp
    fn = tp_plus_fn - tp
    tn = comb(len(A), 2) - tp - fp - fn
    return (tp + tn) / (tp + fp + fn + tn)

#calculate Rand index
rand_index([1, 1, 1, 2, 2], [1, 1, 2, 2, 3])

0.6

The Rand index turns out to be 0.6. This matches the value calculated in the previous examples.

x