These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.
Scope
SUBMIT A METRIC
If you have a tool that you think should be featured in the Catalogue of AI Tools & Metrics, we would love to hear from you!
SUBMIT Receiver Operating Characteristic Curve (ROC) and Area Under the Curve (AUC) 16 related use cases
This metric computes the area under the curve (AUC) for the Receiver Operating Characteristic Curve (ROC). The return values represent how well the model used is predicting the correct classes, based on the input data. A score of 0.5 means that the model is...
Objectives:
Bilingual Evaluation Understudy (BLEU) 15 related use cases
Bilingual Evaluation Understudy (BLEU) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine’s output and that of a human: ...
Objectives:
Recall-Oriented Understudy for Gisting Evaluation (ROUGE) 10 related use cases
ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produce...
Objectives:
Gender-based Illicit Proximity Estimate (GIPE) 4 related use cases
This paper proposes a new bias evaluation metric – Gender-based Illicit Proximity Estimate (GIPE), which measures the extent of undue proximity in word vectors resulting from the presence of gender-based predilections. Experiments based on a suite of...
Objectives:
Consensus-based Image Description Evaluation (CIDEr) 3 related use cases
The CIDEr (Consensus-based Image Description Evaluation) metric is a way of evaluating the quality of generated textual descriptions of images. The CIDEr metric measures the similarity between a generated caption and the reference captions, and it is based ...
Objectives:
SacreBLEU 2 related use cases
SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich’s multi-bleu-detok.perl, it produces the official Workshop on Machine Translation (WMT) scores but works with plain text. It also kn...
Objectives:
Perplexity 2 related use cases
Given a model and an input text sequence, perplexity measures how likely the model is to generate the input text sequence. This can be used in two main ways:
- to evaluate how well the model has learned the distribution of the text it was traine...
Objectives:
Sparsity 1 related use case
While smoothness and spatial locality capture spatial properties, the individual values shall also be sparse, since few highly important regions are more indicative of a good explanation than several mildly relevant ones. This is why a sparsity metric shoul...
Objectives:
Translation Edit Rate (TER) 1 related use case
Translation Edit Rate (TER), also called Translation Error Rate, is a metric to quantify the edit operations that a hypothesis requires to match a reference translation.
Objectives:
Variable Importance Cloud (VIC)
Objectives:
Beta Shapley
Objectives:
Surrogacy Efficacy Score (SESc)
The Surrogacy Efficacy Score is a technique for gaining a better understanding of the inner workings of complex "black box" models. For example, by using a Tree-based model, this method provides a more interpretable representation of the model’s behavior by...
Objectives:
Partial Dependence Complexity (PDC)
The Partial Dependence Complexity metric uses the concept of Partial Dependence curve to evaluate how simple this curve can be represented. The partial dependence curve is used to show model predictions are affected on average by each feature. Curves repres...
Objectives:
α-Feature Importance (αFI)
The α-Feature Importance metric quantifies the minimum proportion of features required to represent α of the total importance. In other words, this metric is focused in obtaining the minimum number of features necessary to obtain no less than α × 100% of th...
Objectives:
Global Feature Importance Spread (GFIS)
The metric GFIS is based on the concept of entropy. More precisely on the entropy of the normalized features measure, which represents the concentration of information within a set of features. Lower entropy values indicate that the majority of the explanat...
Objectives:
SAFE (Sustainable, Accurate, Fair and Explainable)
Machine learning models, at the core of AI applications, typically achieve a high accuracy at the expense of an insufficient explainability. Moreover, according to the proposed regulations, AI applications based on machine learning must be "trus...
Objectives:
Normalized Scanpath Saliency (NSS)
The Normalized Scanpath Saliency was introduced to the saliency community as a simple correspondence measure between saliency maps and ground truth, computed as the average normalized saliency at fixated locations. Unlike in AUC, the absolute saliency value...
Objectives:
Kendall rank correlation coefficient (KRCC)
In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient, is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametric hypothesis test for statistical de...
Objectives:
Learned Perceptual Image Patch Similarity (LPIPS)
The learned perceptual image patch similarity (LPIPS) is used to judge the perceptual similarity between two images. LPIPS is computed with a model that is trained on a labeled dataset of human-judged perceptual similarity. The perception-measuring model co...
Objectives:
Pearson correlation coefficient (PCC)
In statistics, the Pearson correlation coefficient (PCC) ― also known as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ― is a measure of linear corre...
Objectives:
