These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.
Scope
SUBMIT A METRIC
If you have a tool that you think should be featured in the Catalogue of AI Tools & Metrics, we would love to hear from you!
SUBMIT Accuracy 168 related use cases
Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:
Accuracy = (TP + TN) / (TP + TN + FP + FN) , where:
TP: True positive
TN: True negative
FP: False positive
FN...
Objectives:
Mean Intersection over Union (IoU) 35 related use cases
Mean Intersection over Union (IoU) is the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth.
For binary (two classes) or multi-class segmentatio...
Objectives:
Mahalanobis Distance 32 related use cases
Mahalonobis distance is the distance between a point and a distribution (as opposed to the distance between two points), making it the multivariate equivalent of the Euclidean distance.
It is often used in multivariate anomaly detection, classificatio...
Objectives:
Receiver Operating Characteristic Curve (ROC) and Area Under the Curve (AUC) 16 related use cases
This metric computes the area under the curve (AUC) for the Receiver Operating Characteristic Curve (ROC). The return values represent how well the model used is predicting the correct classes, based on the input data. A score of 0.5 means that the model is...
Objectives:
Time until Adversary’s Success 15 related use cases
The most general time-based metric measures the time until the adversary’s success. It assumes that the adversary will succeed eventually, and is therefore an example of a pessimistic metric. This metric relies on a definition of success, and varies depend...
Objectives:
Bilingual Evaluation Understudy (BLEU) 15 related use cases
Bilingual Evaluation Understudy (BLEU) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine’s output and that of a human: ...
Objectives:
Precision 11 related use cases
Precision is the fraction of correctly labeled positive examples out of all of the examples that were labeled as positive. It is computed via the equation: Precision = TP / (TP + FP) where TP is the True positives (i.e. the examples correctly labeled as pos...
Objectives:
Recall 9 related use cases
Recall is the fraction of the positive examples that were correctly labeled by the model as positive. It can be computed with the equation: Recall = TP / (TP + FN) Where TP is the number of true positives and FN is the number of false negatives.
Objectives:
Word Error Rate (WER) 3 related use cases
Word Error Rate (WER) is a common metric of the performance of an automatic speech recognition (ASR) system.
The general difficulty of measuring the performance of ASR systems lies in the fact that the recognized word sequence can have a different len...
Objectives:
SacreBLEU 2 related use cases
SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich’s multi-bleu-detok.perl, it produces the official Workshop on Machine Translation (WMT) scores but works with plain text. It also kn...
Objectives:
F-score 2 related use cases
In statistical analysis of binary classification, the F-score or F-measure is a measure of a test's accuracy. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the number of all...
Objectives:
Perplexity 2 related use cases
Given a model and an input text sequence, perplexity measures how likely the model is to generate the input text sequence. This can be used in two main ways:
- to evaluate how well the model has learned the distribution of the text it was traine...
Objectives:
Exact Match 2 related use cases
A given predicted string’s exact match score is 1 if it is the exact same as its reference string, and is 0 otherwise.
- Example 1: The exact match score of prediction “Happy Birthday!” is 0, given its reference is “Happy New Year!...
Objectives:
Adjusted Rand Index (ARI) 2 related use cases
The Adjusted Rand Index (ARI) is a widely used metric for evaluating the similarity between two clustering assignments. It improves upon the Rand Index (RI) by correcting for chance agreement, making it a more reliable meas...
Objectives:
Mean Per Joint Position Error (MPJPE) 2 related use cases
Mean Per Joint Position Error (MPJPE) is a common metric used to evaluate the performance of human pose estimation algorithms. It measures the average distance between the predicted joints of a human skeleton and the ground truth joints in a given dataset. ...
Objectives:
Out-of-distribution (OOD) generalization 1 related use case
Robustness Metrics provides lightweight modules in order to evaluate the robustness of classification models. OOD generalization is defined as, e.g. a non-expert human would be able to classify similar objects, but possibly changed viewpoint, scene setting...
Objectives:
Stability 1 related use case
Robustness Metrics provides lightweight modules in order to evaluate the robustness of classification models. Stability is defined as, e.g. the stability of the prediction and predicted probabilities under natural perturbation of the input.
The l...
Objectives:
Metric for Evaluation of Translation with Explicit ORdering (METEOR) 1 related use case
Metric for Evaluation of Translation with Explicit ORdering (METEOR) is a machine translation evaluation metric, which is calculated based on the harmonic mean of precision and recall, with recall weighted more than precision.
METEOR is based on a gen...
Objectives:
Mean Average Precision (MAP) 1 related use case
Mean Average Precision (MAP) is a metric used to evaluate object detection models such as Fast R-CNN, YOLO, Mask R-CNN, etc. The mean of average precision(AP) values are calculated over recall values from 0 to 1.
Objectives:
Spearman's rank correlation coefficient (SRCC)
In statistics, Spearman's rank correlation coefficient or Spearman's ρ is a non-parametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be describ...
Objectives:
