These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.
Scope
SUBMIT A METRIC
If you have a tool that you think should be featured in the Catalogue of AI Tools & Metrics, we would love to hear from you!
Submit Accuracy 175 related use cases
Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:
Accuracy = (TP + TN) / (TP + TN + FP + FN) , where:
TP: True positive
TN: True negative
FP: False positive
FN...
Objectives:
Mean Intersection over Union (IoU) 35 related use cases
Mean Intersection over Union (IoU) is the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth.
For binary (two classes) or multi-class segmentatio...
Objectives:
Receiver Operating Characteristic Curve (ROC) and Area Under the Curve (AUC) 16 related use cases
This metric computes the area under the curve (AUC) for the Receiver Operating Characteristic Curve (ROC). The return values represent how well the model used is predicting the correct classes, based on the input data. A score of 0.5 means that the model is...
Objectives:
Bilingual Evaluation Understudy (BLEU) 15 related use cases
Bilingual Evaluation Understudy (BLEU) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine’s output and that of a human: ...
Objectives:
Recall-Oriented Understudy for Gisting Evaluation (ROUGE) 10 related use cases
ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produce...
Objectives:
Gender-based Illicit Proximity Estimate (GIPE) 5 related use cases
This paper proposes a new bias evaluation metric – Gender-based Illicit Proximity Estimate (GIPE), which measures the extent of undue proximity in word vectors resulting from the presence of gender-based predilections. Experiments based on a suite of ...
Objectives:
Word Error Rate (WER) 3 related use cases
Word Error Rate (WER) is a common metric of the performance of an automatic speech recognition (ASR) system.
The general difficulty of measuring the performance of ASR systems lies in the fact that the recognized word sequence can have a different len...
Objectives:
Mean Per Joint Position Error (MPJPE) 2 related use cases
Mean Per Joint Position Error (MPJPE) is a common metric used to evaluate the performance of human pose estimation algorithms. It measures the average distance between the predicted joints of a human skeleton and the ground truth joints in a given dataset. ...
Objectives:
Adjusted Rand Index (ARI) 2 related use cases
The Adjusted Rand Index (ARI) is a widely used metric for evaluating the similarity between two clustering assignments. It improves upon the Rand Index (RI) by correcting for chance agreement, making it a more reliable meas...
Objectives:
Exact Match 2 related use cases
A given predicted string’s exact match score is 1 if it is the exact same as its reference string, and is 0 otherwise.
- Example 1: The exact match score of prediction “Happy Birthday!” is 0, given its reference is “Happy New Year!...
Objectives:
Perplexity 2 related use cases
Given a model and an input text sequence, perplexity measures how likely the model is to generate the input text sequence. This can be used in two main ways:
- to evaluate how well the model has learned the distribution of the text it was traine...
Objectives:
SacreBLEU 2 related use cases
SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich’s multi-bleu-detok.perl, it produces the official Workshop on Machine Translation (WMT) scores but works with plain text. It also kn...
Objectives:
Mean Squared Error (MSE) 1 related use case
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between the estimate...
Objectives:
Crosslingual Optimized Metric for Evaluation of Translation (COMET) 1 related use case
Crosslingual Optimized Metric for Evaluation of Translation (COMET) is a metric for automatic evaluation of machine translation that calculates the similarity between a machine translation output and a reference translation using token or sentence embedding...
Objectives:
Statistical Parity Difference (SPD) 1 related use case
We study fairness in classification, where individuals are classified, e.g., admitted to a university, and the goal is to prevent discrimination against individuals based on their membership in some group, while maintaining utility for the classifier (the u...
Objectives:
Character error rate (CER) 1 related use case
CER supports Safety by reducing the likelihood of harmful or misleading outputs due to transcription errors, which is especially important in domains like healthcare or legal transcription. It also supports Robustness by providing ...
Objectives:
Mean Absolute Error (MAE)
In statistics, mean absolute error (MAE) is a measure of errors between paired observations expressing the same phenomenon. Examples of Y versus X include comparisons of predicted versus observed, subsequent time versus initial time, and one technique of me...
Objectives:
FrugalScore
FrugalScore is a reference-based metric for Natural Language Generation (NLG) model evaluation. It is based on a distillation approach that allows to learn a fixed, low cost version of any expensive NLG metric, while retaining most of its original performan...
Objectives:
MAUVE
MAUVE is a library built on PyTorch and HuggingFace Transformers to measure the gap between neural text and human text with the eponymous MAUVE measure. It summarizes both Type I and Type II errors measured softly using Kullback–Leibler (KL) divergences.
Objectives:
3D Pose Correct Keypoints
The '3DPCK' metric (3D Pose Correct Keypoints) is a performance metric used to evaluate the accuracy of 3D human pose estimation algorithms. It measures the percentage of keypoints for which the estimated 3D pose is within a certain distance from the ground...
Objectives:



























