These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.
The Adjusted Rand Index (ARI) is a widely used metric for evaluating the similarity between two clustering assignments. It improves upon the Rand Index (RI) by correcting for chance agreement, making it a more reliable measure of clustering similarity. Unlike the unadjusted Rand Index, which ranges from 0 to 1, the ARI can take values between -1 and 1, where higher values indicate stronger similarity.
Applicable Models
ARI is applicable to various clustering algorithms, including:
• Partition-based clustering (e.g., k-means, Gaussian Mixture Models)
• Hierarchical clustering (e.g., agglomerative clustering)
• Spectral clustering
• Density-based clustering (e.g., DBSCAN, OPTICS)
• Community detection algorithms (e.g., Louvain method for graph clustering)
Background
The Rand Index (RI) is a fundamental measure of clustering agreement, based on counting the number of data point pairs that are consistently clustered in two different partitions. However, RI does not account for random chance, potentially inflating similarity scores. The ARI corrects for this by incorporating the expected similarity under random clustering, ensuring a more meaningful evaluation.
Formulae
Let N be the total number of samples, and let C1 and C2 represent two clustering assignments. Define:
• a: The number of sample pairs in the same cluster in both C1 and C2.
• b: The number of sample pairs in different clusters in both C1 and C2.
• c: The number of sample pairs in the same cluster in C1 but in different clusters in C2.
• d: The number of sample pairs in different clusters in C1 but in the same cluster in C2.
The Rand Index (RI) is given by:
RI = (a + b) / (a + b + c + d)
The expected value (E) for the Rand Index is computed as:
E = (Σ (n_i choose 2) * Σ (m_j choose 2)) / (N choose 2)
where n_i and m_j are the sizes of clusters in C1 and C2, respectively.
The Adjusted Rand Index (ARI) is then computed as:
ARI = (RI - E) / (1 - E)
The ARI is bounded between -1 and 1, where:
• 1 indicates perfect agreement between two clustering results.
• 0 corresponds to random clustering agreement.
• Negative values suggest worse-than-random clustering agreement.
Contingency Table Representation
A contingency table provides an alternative way to compute ARI. Given a dataset S of N samples, and two partitions C1 and C2, we define a contingency table where each entry n_ij represents the number of samples common to cluster i in C1 and cluster j in C2.
Using this table, the ARI formula can be rewritten as:
ARI = (Σ (n_ij choose 2) - [(Σ (a_i choose 2) * Σ (b_j choose 2)) / (N choose 2)]) / (0.5 * (Σ (a_i choose 2) + Σ (b_j choose 2)) - [(Σ (a_i choose 2) * Σ (b_j choose 2)) / (N choose 2)])
where:
• a_i is the sum of row i in the contingency table (size of cluster i in C1).
• b_j is the sum of column j (size of cluster j in C2).
This formulation aligns with the one commonly found in literature, such as in the JMLR paper (v18/17-039).
Applications
ARI is widely used for evaluating clustering algorithms in:
• Machine Learning & Data Mining: Comparing different clustering methods.
• Bioinformatics: Gene expression clustering and cell type identification.
• Social Network Analysis: Measuring community detection performance.
• Market Segmentation: Assessing customer segmentation models.
Impact
The Adjusted Rand Index provides a robust, chance-corrected, and interpretable way to evaluate clustering similarity. By adjusting for expected agreement, it prevents misleadingly high similarity scores and enables fair comparisons across clustering algorithms and datasets.
Related use cases :
SPICE: Semantic Pseudo-labeling for Image Clustering
Uploaded on Mar 27, 2023Information Maximization Clustering via Multi-View Self-Labelling
Uploaded on Mar 27, 2023About the metric
You can click on the links to see the associated metrics
Objective(s):
Target sector(s):
Lifecycle stage(s):
Target users:
Risk management stage(s):
