These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.
Scope
SUBMIT A METRIC
If you have a tool that you think should be featured in the Catalogue of AI Tools & Metrics, we would love to hear from you!
Submit Stability 1 related use case
Robustness Metrics provides lightweight modules in order to evaluate the robustness of classification models. Stability is defined as, e.g. the stability of the prediction and predicted probabilities under natural perturbation of the input.
The li...
Objectives:
Translation Edit Rate (TER) 1 related use case
Translation Edit Rate (TER), also called Translation Error Rate, is a metric to quantify the edit operations that a hypothesis requires to match a reference translation.
Trustworthy AI Relevance
This metric addresses Transpare...
Objectives:
Equal outcomes
In the field of health, equal patient outcomes refers to the assurance that protected groups have equal benefit in terms of patient outcomes from the deployment of machine-learning models. A weak form of equal outcomes is ensuring that both the protect...
Objectives:
Natural Image Quality Evaluator (NIQE)
Natural image quality evaluator (NIQE) calculates the no-reference image quality score for images that may be distorted or of low perceptual quality. The contribution of this metric is derived from not requiring a known class of image distortions or percept...
Objectives:
Context Recall
Context Recall assesses how effectively a model retrieves all relevant pieces of information necessary to generate a comprehensive and accurate response. Unlike precision, which focuses on relevance, recall emphasizes completeness, ensuring that no critical...
Objectives:
Response Relevancy
Response Relevancy evaluates how closely the generated answer aligns with the input query. This metric assigns a higher score to answers that directly and completely address the question, while penalizing answers that are incomplete or contain redundant inf...
Objectives:
Agent Goal Accuracy
Agent Goal Accuracy is a metric used to evaluate the effectiveness of a language model in accurately identifying and achieving a user’s intended goals during an interaction. This binary metric assigns a score of 1 if the AI successfully accomplishes the use...
Objectives:
TrueTeacher
TrueTeacher is a model-based metric designed to evaluate the factual consistency of generated summaries by comparing them against the original text. It utilizes a T5-11B model fine-tuned on a synthetic dataset specifically curated for consistency evaluation...
Objectives:
JobFair (A Framework for Benchmarking Gender Hiring Bias in Large Language Models)
JobFair is a robust framework designed to benchmark and evaluate hierarchical gender hiring biases in Large Language Models (LLMs) used for resume scoring. It identifies and quantifies two primary types of bias: Level Bias (differences in a...
Objectives:
Reject Rate (RR)
The Reject Rate is a metric used to evaluate the frequency at which a large language model (LLM) refuses to provide a response to a query. It is particularly relevant in scenarios where refusal is expected to mitigate risks associated with unsafe, biased, o...
Objectives:



























