Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

Robustness & digital security

Clear all

Scope

SUBMIT A METRIC

If you have a tool that you think should be featured in the Catalogue of AI Tools & Metrics, we would love to hear from you!

SUBMIT
This page includes technical metrics and methodologies for measuring and evaluating AI trustworthiness and AI risks. These metrics are often represented through mathematical formulas that assess the technical requirements for achieving trustworthy AI in a particular context. They can help to ensure that a system is fair, accurate, explainable, transparent, robust, safe, or secure.
Objective Robustness & digital security

The most general time-based metric measures the time until the adversary’s success. It assumes that the adversary will succeed eventually, and is therefore an example of a pessimistic metric. This metric relies on a definition of success, and varies depend...


Robustness Metrics provides lightweight modules in order to evaluate the robustness of classification models. Stability is defined as, e.g. the stability of the prediction and predicted probabilities under natural perturbation of the input.

The l...


Robustness Metrics provides lightweight modules in order to evaluate the robustness of classification models. OOD generalization is defined as, e.g. a non-expert human would be able to classify similar objects, but possibly changed viewpoint, scene setting...


Tool Call Accuracy evaluates the effectiveness of a language model (LLM) in accurately identifying and invoking the necessary tools to accomplish a specified task. This metric is essential for assessing the model’s capability to select and utilize appropria...


Topic Adherence evaluates an AI system’s ability to confine its responses to predefined subject areas during interactions. This metric is crucial in applications where the AI is expected to assist only within specific domains, ensuring that responses remain...


Faithfulness is a metric that assesses the factual consistency of the model’s generated response with respect to the provided context. This metric ensures that every claim made in the answer can be supported or inferred from the context. The score ranges fr...


Response Relevancy evaluates how closely the generated answer aligns with the input query. This metric assigns a higher score to answers that directly and completely address the question, while penalizing answers that are incomplete or contain redundant inf...


Context Entities Recall measures the recall of entities in retrieved contexts based on the entities present in both the reference and retrieved contexts, relative to the entities in the reference alone. This metric evaluates what fraction of entities in the...


Noise Sensitivity measures how susceptible a language model is to making errors when exposed to irrelevant or noisy information in the context. Specifically, it evaluates the likelihood of the model generating incorrect responses due to both relevant and ir...


Context Recall assesses how effectively a model retrieves all relevant pieces of information necessary to generate a comprehensive and accurate response. Unlike precision, which focuses on relevance, recall emphasizes completeness, ensuring that no critical...


Context Precision is a metric that quantifies the accuracy of relevant information in the retrieved contexts used by language models. This metric is particularly significant in Retrieval-Augmented Generation (RAG) settings, where precise retrieval of contex...


We propose a set of interrelated metrics, all based on the notion of AI output concentration, and the related Lorenz curve/Lorenz area under the curve, able to measure the Sustainability/robustness, Accuracy, Fairness/privacy, Explainability/accountability ...


Ideally we would like to obtain a more complete understanding of variable importance for the set of models that predict almost equally well. This set of almost-equally-accurate predictive models is called the Rashomon set; it is the set of models with training...

CLIPBERTSCORE is a simple weighted combination of CLIPScore (Hessel et al., 2021) and BERTScore (Zhang* et al., 2020) to leverage the robustness and strong factuality detection performance between image-summary and document-summary, respectively.

Tree Edit Distance (TED) is a metric for calculation of similarity between syntactic n-grams for further detection of soft similarity between texts.


In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient, is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametric hypothesis test for statistical de...


Multi-object tracking accuracy (MOTA) shows how many errors the tracker system has made in terms of misses, false positives, mismatch errors, etc. Therefore, it can be derived from three error ratios: the ratio of misses, the ratio of false positives, and t...


The structural similarity index measure (SSIM) measures the perceived similarity of two images. When one image is a modified version of the other (e.g., if it is compressed) the SSIM serves as a measure of the fidelity of the compressed representation. The ...


False rejection rate (FRR) is a security metric used to measure the performance of biometric systems such as voice recognition, fingerprint recognition, face recognition, or iris recognition. It represents the likelihood of a biometric system mistakenly rej...


False acceptance rate (FAR) is a security metric used to measure the performance of biometric systems such as voice recognition, fingerprint recognition, face recognition, or iris recognition. It represents the likelihood of a biometric system mistakenly ac...


catalogue Logos

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.