These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.
Scope
SUBMIT A METRIC
If you have a tool that you think should be featured in the Catalogue of AI Tools & Metrics, we would love to hear from you!
SUBMIT Time until Adversary’s Success 12 related use cases
The most general time-based metric measures the time until the adversary’s success. It assumes that the adversary will succeed eventually, and is therefore an example of a pessimistic metric. This metric relies on a definition of success, and varies depend...
Objectives:
Stability 1 related use case
Robustness Metrics provides lightweight modules in order to evaluate the robustness of classification models. Stability is defined as, e.g. the stability of the prediction and predicted probabilities under natural perturbation of the input.
The l...
Objectives:
Out-of-distribution (OOD) generalization 1 related use case
Robustness Metrics provides lightweight modules in order to evaluate the robustness of classification models. OOD generalization is defined as, e.g. a non-expert human would be able to classify similar objects, but possibly changed viewpoint, scene setting...
Objectives:
Tool call Accuracy
Tool Call Accuracy evaluates the effectiveness of a language model (LLM) in accurately identifying and invoking the necessary tools to accomplish a specified task. This metric is essential for assessing the model’s capability to select and utilize appropria...
Objectives:
Topic Adherence
Topic Adherence evaluates an AI system’s ability to confine its responses to predefined subject areas during interactions. This metric is crucial in applications where the AI is expected to assist only within specific domains, ensuring that responses remain...
Objectives:
Faithfulness
Faithfulness is a metric that assesses the factual consistency of the model’s generated response with respect to the provided context. This metric ensures that every claim made in the answer can be supported or inferred from the context. The score ranges fr...
Objectives:
Response Relevancy
Response Relevancy evaluates how closely the generated answer aligns with the input query. This metric assigns a higher score to answers that directly and completely address the question, while penalizing answers that are incomplete or contain redundant inf...
Objectives:
Context Entities Recall
Context Entities Recall measures the recall of entities in retrieved contexts based on the entities present in both the reference and retrieved contexts, relative to the entities in the reference alone. This metric evaluates what fraction of entities in the...
Objectives:
Noise Sensitivity
Noise Sensitivity measures how susceptible a language model is to making errors when exposed to irrelevant or noisy information in the context. Specifically, it evaluates the likelihood of the model generating incorrect responses due to both relevant and ir...
Objectives:
Context Recall
Context Recall assesses how effectively a model retrieves all relevant pieces of information necessary to generate a comprehensive and accurate response. Unlike precision, which focuses on relevance, recall emphasizes completeness, ensuring that no critical...
Objectives:
Context Precision
Context Precision is a metric that quantifies the accuracy of relevant information in the retrieved contexts used by language models. This metric is particularly significant in Retrieval-Augmented Generation (RAG) settings, where precise retrieval of contex...
Objectives:
SAFE Artificial Intelligence in finance
We propose a set of interrelated metrics, all based on the notion of AI output concentration, and the related Lorenz curve/Lorenz area under the curve, able to measure the Sustainability/robustness, Accuracy, Fairness/privacy, Explainability/accountability ...
Objectives:
Variable Importance Cloud (VIC)
Objectives:
CLIPSBERTScore
Objectives:
Tree Edit Distance (TED)
Tree Edit Distance (TED) is a metric for calculation of similarity between syntactic n-grams for further detection of soft similarity between texts.
Objectives:
Kendall rank correlation coefficient (KRCC)
In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient, is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametric hypothesis test for statistical de...
Objectives:
Multi-Object Tracking Accuracy (MOTA)
Multi-object tracking accuracy (MOTA) shows how many errors the tracker system has made in terms of misses, false positives, mismatch errors, etc. Therefore, it can be derived from three error ratios: the ratio of misses, the ratio of false positives, and t...
Objectives:
Structural Similarity Index (SSIM)
The structural similarity index measure (SSIM) measures the perceived similarity of two images. When one image is a modified version of the other (e.g., if it is compressed) the SSIM serves as a measure of the fidelity of the compressed representation. The ...
Objectives:
False Rejection Rate (FRR)
False rejection rate (FRR) is a security metric used to measure the performance of biometric systems such as voice recognition, fingerprint recognition, face recognition, or iris recognition. It represents the likelihood of a biometric system mistakenly rej...
Objectives:
False Acceptance Rate (FAR)
False acceptance rate (FAR) is a security metric used to measure the performance of biometric systems such as voice recognition, fingerprint recognition, face recognition, or iris recognition. It represents the likelihood of a biometric system mistakenly ac...
Objectives: