Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

Safety

Clear all

Scope

SUBMIT A METRIC

If you have a tool that you think should be featured in the Catalogue of AI Tools & Metrics, we would love to hear from you!

Submit
This page includes technical metrics and methodologies for measuring and evaluating AI trustworthiness and AI risks. These metrics are often represented through mathematical formulas that assess the technical requirements for achieving trustworthy AI in a particular context. They can help to ensure that a system is fair, accurate, explainable, transparent, robust, safe, or secure.
Objective Safety

Mean Per Joint Position Error (MPJPE) is a common metric used to evaluate the performance of human pose estimation algorithms. It measures the average distance between the predicted joints of a human skeleton and the ground truth joints in a given dataset. ...


 

 

CER supports Safety by reducing the likelihood of harmful or misleading outputs due to transcription errors, which is especially important in domains like healthcare or legal transcription. It also supports Robustness by providing ...

Objectives:


The Reject Rate is a metric used to evaluate the frequency at which a large language model (LLM) refuses to provide a response to a query. It is particularly relevant in scenarios where refusal is expected to mitigate risks associated with unsafe, biased, o...

Objectives:


In statistics, mean absolute error (MAE) is a measure of errors between paired observations expressing the same phenomenon. Examples of Y versus X include comparisons of predicted versus observed, subsequent time versus initial time, and one technique of me...

Objectives:


The Hughes Hallucination Evaluation Model (HHEM) Score is a metric designed to detect hallucinations in text generated by AI systems. It outputs a probability score between 0 and 1, where 0 indicates hallucination and 1 indicates factual consistency. The me...


CHAIR is a metric designed to measure object hallucination in image captioning models, assessing the relevance of generated captions to the actual image content. It evaluates how often models “hallucinate” objects not present in the image and introduces a n...


The Attack Success Rate (ASR) measures the effectiveness of adversarial attacks against machine learning models. It is calculated as the percentage of attacks that successfully cause a model to misclassify or generate incorrect outputs. Thi...

Objectives:


HaRiM+ is a reference-free evaluation metric that assesses the quality of generated summaries by estimating the hallucination risk within the summarization process. It uses a modified summarization model to measure how closely generated summaries align with...


Aspect Critic is an evaluation metric used to assess responses based on predefined criteria, called “aspects,” written in natural language. This metric produces a binary output—either ‘Yes’ (1) or ‘No’ (0)—indicating whether the response meets the specified...


Topic Adherence evaluates an AI system’s ability to confine its responses to predefined subject areas during interactions. This metric is crucial in applications where the AI is expected to assist only within specific domains, ensuring that responses remain...


Faithfulness is a metric that assesses the factual consistency of the model’s generated response with respect to the provided context. This metric ensures that every claim made in the answer can be supported or inferred from the context. The score ranges fr...


The nuScenes Detection Score is a performance metric used to evaluate object detection algorithms in autonomous driving scenarios. It is used to evaluate the quality of object detection algorithms in the nuScenes dataset, which is a large-scale autonomous d...

Objectives:


The Dice score, also known as the Dice Similarity Coefficient, is a measure of the similarity between two sets of data, usually represented as binary arrays. In the context of image segmentation, for example, the Dice score can be used to evaluate the simil...

Objectives:


In object tracking problems (e.g., "where is the human in this image?"), Higher order tracking accuracy (HOTA) measures how well the trajectories of matching detections align, and averages this over all matching detections, while also penalising detections ...

Objectives:


Multi-object tracking accuracy (MOTA) shows how many errors the tracker system has made in terms of misses, false positives, mismatch errors, etc. Therefore, it can be derived from three error ratios: the ratio of misses, the ratio of false positives, and t...

Objectives:


Natural image quality evaluator (NIQE) calculates the no-reference image quality score for images that may be distorted or of low perceptual quality. The contribution of this metric is derived from not requiring a known class of image distortions or percept...

Objectives:


The structural similarity index measure (SSIM) measures the perceived similarity of two images. When one image is a modified version of the other (e.g., if it is compressed) the SSIM serves as a measure of the fidelity of the compressed representation. The ...

Objectives:


False rejection rate (FRR) is a security metric used to measure the performance of biometric systems such as voice recognition, fingerprint recognition, face recognition, or iris recognition. It represents the likelihood of a biometric system mistakenly rej...

Objectives:


False acceptance rate (FAR) is a security metric used to measure the performance of biometric systems such as voice recognition, fingerprint recognition, face recognition, or iris recognition. It represents the likelihood of a biometric system mistakenly ac...


The '3DPCK' metric (3D Pose Correct Keypoints) is a performance metric used to evaluate the accuracy of 3D human pose estimation algorithms. It measures the percentage of keypoints for which the estimated 3D pose is within a certain distance from the ground...

Objectives:


catalogue Logos

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.