Metrics for Trustworthy AI

Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

Overview Tools Metrics About the catalogue

Show metrics Show use cases

Performance

Clear all

Scope

SUBMIT A METRIC

If you have a tool that you think should be featured in the Catalogue of AI Tools & Metrics, we would love to hear from you!

Submit

This page includes technical metrics and methodologies for measuring and evaluating AI trustworthiness and AI risks. These metrics are often represented through mathematical formulas that assess the technical requirements for achieving trustworthy AI in a particular context. They can help to ensure that a system is fair, accurate, explainable, transparent, robust, safe, or secure.

Objective Performance

Accuracy 169 related use cases

Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:

Accuracy = (TP + TN) / (TP + TN + FP + FN) , where:

TP: True positive

TN: True negative

FP: False positive

FN...

Objectives:

Performance Robustness

Mean Intersection over Union (IoU) 35 related use cases

Mean Intersection over Union (IoU) is the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth.

For binary (two classes) or multi-class segmentatio...

Objectives:

Performance Robustness

Mahalanobis Distance 32 related use cases

Mahalonobis distance is the distance between a point and a distribution (as opposed to the distance between two points), making it the multivariate equivalent of the Euclidean distance.

It is often used in multivariate anomaly detection, classificatio...

Objectives:

Performance Robustness

Receiver Operating Characteristic Curve (ROC) and Area Under the Curve (AUC) 16 related use cases

This metric computes the area under the curve (AUC) for the Receiver Operating Characteristic Curve (ROC). The return values represent how well the model used is predicting the correct classes, based on the input data. A score of 0.5 means that the model is...

Objectives:

Performance Robustness Explainability

Time until Adversary’s Success 15 related use cases

The most general time-based metric measures the time until the adversary’s success. It assumes that the adversary will succeed eventually, and is therefore an example of a pessimistic metric. This metric relies on a definition of success, and varies depend...

Objectives:

Performance Robustness Digital Security

Bilingual Evaluation Understudy (BLEU) 15 related use cases

Bilingual Evaluation Understudy (BLEU) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine’s output and that of a human: ...

Objectives:

Performance Robustness Explainability

Precision 11 related use cases

Precision is the fraction of correctly labeled positive examples out of all of the examples that were labeled as positive. It is computed via the equation: Precision = TP / (TP + FP) where TP is the True positives (i.e. the examples correctly labeled as pos...

Objectives:

Performance Robustness

Recall-Oriented Understudy for Gisting Evaluation (ROUGE) 10 related use cases

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produce...

Objectives:

Performance Transparency Explainability

Recall 9 related use cases

Recall is the fraction of the positive examples that were correctly labeled by the model as positive. It can be computed with the equation: Recall = TP / (TP + FN) Where TP is the number of true positives and FN is the number of false negatives.

Objectives:

Fairness Performance Robustness

Consensus-based Image Description Evaluation (CIDEr) 3 related use cases

The CIDEr (Consensus-based Image Description Evaluation) metric is a way of evaluating the quality of generated textual descriptions of images. The CIDEr metric measures the similarity between a generated caption and the reference captions, and it is based ...

Objectives:

Performance Explainability

Word Error Rate (WER) 3 related use cases

Word Error Rate (WER) is a common metric of the performance of an automatic speech recognition (ASR) system.

The general difficulty of measuring the performance of ASR systems lies in the fact that the recognized word sequence can have a different len...

Objectives:

Performance Robustness

SacreBLEU 2 related use cases

SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich’s multi-bleu-detok.perl, it produces the official Workshop on Machine Translation (WMT) scores but works with plain text. It also kn...

Objectives:

Performance Robustness Explainability

F-score 2 related use cases

In statistical analysis of binary classification, the F-score or F-measure is a measure of a test's accuracy. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the number of all...

Objectives:

Fairness Performance Robustness

Perplexity 2 related use cases

Given a model and an input text sequence, perplexity measures how likely the model is to generate the input text sequence. This can be used in two main ways:

- to evaluate how well the model has learned the distribution of the text it was traine...

Objectives:

Performance Robustness Explainability

Exact Match 2 related use cases

A given predicted string’s exact match score is 1 if it is the exact same as its reference string, and is 0 otherwise.

Example 1: The exact match score of prediction “Happy Birthday!” is 0, given its reference is “Happy New Year!...

Objectives:

Fairness Performance Robustness

Adjusted Rand Index (ARI) 2 related use cases

The Adjusted Rand Index (ARI) is a widely used metric for evaluating the similarity between two clustering assignments. It improves upon the Rand Index (RI) by correcting for chance agreement, making it a more reliable meas...

Objectives:

Performance Robustness

Mean Per Joint Position Error (MPJPE) 2 related use cases

Mean Per Joint Position Error (MPJPE) is a common metric used to evaluate the performance of human pose estimation algorithms. It measures the average distance between the predicted joints of a human skeleton and the ground truth joints in a given dataset. ...

Objectives:

Performance Robustness Safety

Cross-lingual Natural Language Inference (XNLI) 1 related use case

The XNLI metric allows to evaluate a model’s score on the XNLI dataset, which is a subset of a few thousand examples from the MNLI dataset that have been translated into a 14 different languages, some of which are relatively low resource such as Swahili and...

Objectives:

Performance

Conditional Entropy 1 related use case

We discuss information-theoretic anonymity metrics, that use entropy over the distribution of all possible recipients to quantify anonymity. We identify a common misconception: the entropy of the distribution describing the potential receivers does not alw...

Objectives:

Performance Privacy

Sparsity 1 related use case

While smoothness and spatial locality capture spatial properties, the individual values shall also be sparse, since few highly important regions are more indicative of a good explanation than several mildly relevant ones. This is why a sparsity metric shoul...

Objectives:

Performance Transparency Explainability

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.