Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

Robustness

Clear all

Scope

SUBMIT A METRIC

If you have a tool that you think should be featured in the Catalogue of AI Tools & Metrics, we would love to hear from you!

Submit
This page includes technical metrics and methodologies for measuring and evaluating AI trustworthiness and AI risks. These metrics are often represented through mathematical formulas that assess the technical requirements for achieving trustworthy AI in a particular context. They can help to ensure that a system is fair, accurate, explainable, transparent, robust, safe, or secure.
Objective Human Agency & Control

Robustness Metrics provides lightweight modules in order to evaluate the robustness of classification models. Stability is defined as, e.g. the stability of the prediction and predicted probabilities under natural perturbation of the input.

The li...


Translation Edit Rate (TER), also called Translation Error Rate, is a metric to quantify the edit operations that a hypothesis requires to match a reference translation. 

Trustworthy AI Relevance

This metric addresses Transpare...


In the field of health, equal patient outcomes refers to the assurance that protected groups have equal benefit in terms of patient outcomes from the deployment of machine-learning models. A weak form of equal outcomes is ensuring that both the protect...


Natural image quality evaluator (NIQE) calculates the no-reference image quality score for images that may be distorted or of low perceptual quality. The contribution of this metric is derived from not requiring a known class of image distortions or percept...


Context Recall assesses how effectively a model retrieves all relevant pieces of information necessary to generate a comprehensive and accurate response. Unlike precision, which focuses on relevance, recall emphasizes completeness, ensuring that no critical...


Response Relevancy evaluates how closely the generated answer aligns with the input query. This metric assigns a higher score to answers that directly and completely address the question, while penalizing answers that are incomplete or contain redundant inf...


Agent Goal Accuracy is a metric used to evaluate the effectiveness of a language model in accurately identifying and achieving a user’s intended goals during an interaction. This binary metric assigns a score of 1 if the AI successfully accomplishes the use...


TrueTeacher is a model-based metric designed to evaluate the factual consistency of generated summaries by comparing them against the original text. It utilizes a T5-11B model fine-tuned on a synthetic dataset specifically curated for consistency evaluation...


JobFair is a robust framework designed to benchmark and evaluate hierarchical gender hiring biases in Large Language Models (LLMs) used for resume scoring. It identifies and quantifies two primary types of bias: Level Bias (differences in a...


The Reject Rate is a metric used to evaluate the frequency at which a large language model (LLM) refuses to provide a response to a query. It is particularly relevant in scenarios where refusal is expected to mitigate risks associated with unsafe, biased, o...


Partnership on AI

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.