These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.
Scope
SUBMIT A METRIC
If you have a tool that you think should be featured in the Catalogue of AI Tools & Metrics, we would love to hear from you!
SUBMIT Equal performance 24 related use cases
If a model systematically makes errors disproportionately for patients in the protected group, it is likely to lead to unequal outcomes. Equal performance refers to the assurance that a model is equally accurate for patients in the protec...
Objectives:
Recall 9 related use cases
Recall is the fraction of the positive examples that were correctly labeled by the model as positive. It can be computed with the equation: Recall = TP / (TP + FN) Where TP is the number of true positives and FN is the number of false negatives.
Objectives:
Gender-based Illicit Proximity Estimate (GIPE) 4 related use cases
This paper proposes a new bias evaluation metric – Gender-based Illicit Proximity Estimate (GIPE), which measures the extent of undue proximity in word vectors resulting from the presence of gender-based predilections. Experiments based on a suite of...
Objectives:
F-score 2 related use cases
In statistical analysis of binary classification, the F-score or F-measure is a measure of a test's accuracy. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the number of all...
Objectives:
Exact Match 2 related use cases
A given predicted string’s exact match score is 1 if it is the exact same as its reference string, and is 0 otherwise.
- Example 1: The exact match score of prediction “Happy Birthday!” is 0, given its reference is “Happy New Year!...
Objectives:
Equality of Opportunity Difference (EOD) 1 related use case
We propose a criterion for discrimination against a specified sensitive attribute in supervised learning, where the goal is to predict some target based on available features. Assuming data about the predictor, target, and membership in the protected group...
Objectives:
Translation Edit Rate (TER) 1 related use case
Translation Edit Rate (TER), also called Translation Error Rate, is a metric to quantify the edit operations that a hypothesis requires to match a reference translation.
Objectives:
Statistical Parity Difference (SPD) 1 related use case
We study fairness in classification, where individuals are classified, e.g., admitted to a university, and the goal is to prevent discrimination against individuals based on their membership in some group, while maintaining utility for the classifier (the ...
Objectives:
WinoST
The scientific community is increasingly aware of the necessity to embrace pluralism and consistently represent major and minor social groups. Currently, there are no standard evaluation techniques for different types of biases. Accordingly, there is an urg...
Objectives:
SVEva Fair
Despite the success of deep neural networks (DNNs) in enabling on-device voice assistants, increasing evidence of bias and discrimination in machine learning is raising the urgency of investigating the fairness of these systems. Speaker verification is a fo...
Objectives:
SAFE Artificial Intelligence in finance
We propose a set of interrelated metrics, all based on the notion of AI output concentration, and the related Lorenz curve/Lorenz area under the curve, able to measure the Sustainability/robustness, Accuracy, Fairness/privacy, Explainability/accountability ...
Objectives:
Conditional Demographic Disparity (CDD)
The demographic disparity metric (DD) determines whether a facet has a larger proportion of the rejected outcomes in the dataset than of the accepted outcomes. In the binary case where there are two facets, men and women for example, that constitute the dat...
Objectives:
Rank-Aware Divergence (RADio)
Objectives:
Data Banzhaf
Objectives:
Data Shapley
Objectives:
Predictions Groups Contrast (PGC)
The PGC metric compares the top-K ranking of features importance drawn from the entire dataset with the top-K ranking induced from specific subgroups of predictions. It can be applied to both categorical and regression problems, being useful for quantifying...
Objectives:
SAFE (Sustainable, Accurate, Fair and Explainable)
Machine learning models, at the core of AI applications, typically achieve a high accuracy at the expense of an insufficient explainability. Moreover, according to the proposed regulations, AI applications based on machine learning must be "trus...
Objectives:
Cohen's Kappa coefficient
Cohen's kappa coefficient is a statistic that is used to measure inter-rater reliability (and also intra-rater reliability) for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, a...
Objectives:
Spearman's rank correlation coefficient (SRCC)
In statistics, Spearman's rank correlation coefficient or Spearman's ρ is a non-parametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be describ...
Objectives:
Equal outcomes
In the field of health, equal patient outcomes refers to the assurance that protected groups have equal benefit in terms of patient outcomes from the deployment of machine-learning models. A weak form of equal outcomes is ensuring that both the protect...
Objectives:
