Metrics for Trustworthy AI

Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

Overview Tools Metrics About the catalogue

Show metrics Show use cases

Fairness

Clear all

Scope

SUBMIT A METRIC

If you have a tool that you think should be featured in the Catalogue of AI Tools & Metrics, we would love to hear from you!

Submit

This page includes technical metrics and methodologies for measuring and evaluating AI trustworthiness and AI risks. These metrics are often represented through mathematical formulas that assess the technical requirements for achieving trustworthy AI in a particular context. They can help to ensure that a system is fair, accurate, explainable, transparent, robust, safe, or secure.

Objective Fairness

Equal performance 28 related use cases

If a model systematically makes errors disproportionately for patients in the protected group, it is likely to lead to unequal outcomes. Equal performance refers to the assurance that a model is equally accurate for patients in the protec...

Objectives:

Fairness

Recall 9 related use cases

Recall is the fraction of the positive examples that were correctly labeled by the model as positive. It can be computed with the equation: Recall = TP / (TP + FN) Where TP is the number of true positives and FN is the number of false negatives.

Objectives:

Fairness Robustness

Gender-based Illicit Proximity Estimate (GIPE) 5 related use cases

This paper proposes a new bias evaluation metric – Gender-based Illicit Proximity Estimate (GIPE), which measures the extent of undue proximity in word vectors resulting from the presence of gender-based predilections. Experiments based on a suite of...

Objectives:

Fairness Explainability

Exact Match 2 related use cases

A given predicted string’s exact match score is 1 if it is the exact same as its reference string, and is 0 otherwise.

Example 1: The exact match score of prediction “Happy Birthday!” is 0, given its reference is “Happy New Year!...

Objectives:

Fairness Robustness

F-score 2 related use cases

In statistical analysis of binary classification, the F-score or F-measure is a measure of a test's accuracy. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the number of all...

Objectives:

Fairness Robustness

Cross-lingual Natural Language Inference (XNLI) 1 related use case

The XNLI metric allows to evaluate a model’s score on the XNLI dataset, which is a subset of a few thousand examples from the MNLI dataset that have been translated into a 14 different languages, some of which are relatively low resource such as Swahili and...

Objectives:

Fairness Robustness

Equality of Opportunity Difference (EOD) 1 related use case

We propose a criterion for discrimination against a specified sensitive attribute in supervised learning, where the goal is to predict some target based on available features. Assuming data about the predictor, target, and membership in the protected group...

Objectives:

Fairness

Translation Edit Rate (TER) 1 related use case

Translation Edit Rate (TER), also called Translation Error Rate, is a metric to quantify the edit operations that a hypothesis requires to match a reference translation.

Objectives:

Fairness Explainability

Statistical Parity Difference (SPD) 1 related use case

We study fairness in classification, where individuals are classified, e.g., admitted to a university, and the goal is to prevent discrimination against individuals based on their membership in some group, while maintaining utility for the classifier (the ...

Objectives:

Fairness

Cross-lingual TRansfer Evaluation of Multilingual Encoders for Speech (XTREME-S)

XTREME-S can indirectly support the Fairness objective by providing a means to evaluate whether multilingual speech models perform equitably across different languages. By highlighting disparities in model performance for underrepr...

Objectives:

Fairness

Spearman's rank correlation coefficient (SRCC)

In statistics, Spearman's rank correlation coefficient or Spearman's ρ is a non-parametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be describ...

Objectives:

Fairness Robustness

Equal outcomes

In the field of health, equal patient outcomes refers to the assurance that protected groups have equal benefit in terms of patient outcomes from the deployment of machine-learning models. A weak form of equal outcomes is ensuring that both the protect...

Objectives:

Fairness

False Rejection Rate (FRR)

False rejection rate (FRR) is a security metric used to measure the performance of biometric systems such as voice recognition, fingerprint recognition, face recognition, or iris recognition. It represents the likelihood of a biometric system mistakenly rej...

Objectives:

Fairness Safety

Cohen's Kappa coefficient

Cohen's kappa coefficient is a statistic that is used to measure inter-rater reliability (and also intra-rater reliability) for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, a...

Objectives:

Fairness Robustness

SAFE (Sustainable, Accurate, Fair and Explainable)

Machine learning models, at the core of AI applications, typically achieve a high accuracy at the expense of an insufficient explainability. Moreover, according to the proposed regulations, AI applications based on machine learning must be "trus...

Objectives:

Fairness Environmental Sustainability Explainability

Predictions Groups Contrast (PGC)

The PGC metric compares the top-K ranking of features importance drawn from the entire dataset with the top-K ranking induced from specific subgroups of predictions. It can be applied to both categorical and regression problems, being useful for quantifying...

Objectives:

Fairness Transparency

Data Shapley

In a cooperative game, there are n players D = {1,...,n} and a score function v : 2[n] → R assigns a reward to each of 2 n subsets of players: v(S) is the reward if the players in subset S ⊆ D cooperate. We view the supervised machine learning problem as a coo...

Objectives:

Fairness Data Governance & Traceability

Data Banzhaf

The Banzhaf power index is a power index defined by the probability of changing an outcome of a vote where voting rights are not necessarily equally divided among the voters. Data Banzhaf uses this notion to measure data points' "voting powers" towards algorit...

Objectives:

Fairness

Rank-Aware Divergence (RADio)

RADio introduces a rank-aware Jensen Shannon (JS) divergence. This combination accounts for (i) a user’s decreasing propensity to observe items further down a list and (ii) full distributional shifts as opposed to point estimates.

Objectives:

Fairness Robustness

Conditional Demographic Disparity (CDD)

The demographic disparity metric (DD) determines whether a facet has a larger proportion of the rejected outcomes in the dataset than of the accepted outcomes. In the binary case where there are two facets, men and women for example, that constitute the dat...

Objectives:

Fairness

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.