Metrics for Trustworthy AI

Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

Overview Tools Metrics About the catalogue

Show metrics Show use cases

Safety

Clear all

Scope

SUBMIT A METRIC

If you have a tool that you think should be featured in the Catalogue of AI Tools & Metrics, we would love to hear from you!

Submit

This page includes technical metrics and methodologies for measuring and evaluating AI trustworthiness and AI risks. These metrics are often represented through mathematical formulas that assess the technical requirements for achieving trustworthy AI in a particular context. They can help to ensure that a system is fair, accurate, explainable, transparent, robust, safe, or secure.

Objective Safety

Mean Per Joint Position Error (MPJPE) 2 related use cases

Mean Per Joint Position Error (MPJPE) is a common metric used to evaluate the performance of human pose estimation algorithms. It measures the average distance between the predicted joints of a human skeleton and the ground truth joints in a given dataset. ...

Objectives:

Character error rate (CER) 1 related use case

CER supports Safety by reducing the likelihood of harmful or misleading outputs due to transcription errors, which is especially important in domains like healthcare or legal transcription. It also supports Robustness by providing ...

Objectives:

Robustness Safety

Reject Rate (RR)

The Reject Rate is a metric used to evaluate the frequency at which a large language model (LLM) refuses to provide a response to a query. It is particularly relevant in scenarios where refusal is expected to mitigate risks associated with unsafe, biased, o...

Objectives:

Robustness Safety

Mean Absolute Error (MAE)

In statistics, mean absolute error (MAE) is a measure of errors between paired observations expressing the same phenomenon. Examples of Y versus X include comparisons of predicted versus observed, subsequent time versus initial time, and one technique of me...

Objectives:

Robustness Safety

Hughes Hallucination Evaluation Model (HHEM) Score

The Hughes Hallucination Evaluation Model (HHEM) Score is a metric designed to detect hallucinations in text generated by AI systems. It outputs a probability score between 0 and 1, where 0 indicates hallucination and 1 indicates factual consistency. The me...

Objectives:

Performance Robustness Safety

Caption Hallucination Assessment with Image Relevance (CHAIR)

CHAIR is a metric designed to measure object hallucination in image captioning models, assessing the relevance of generated captions to the actual image content. It evaluates how often models “hallucinate” objects not present in the image and introduces a n...

Objectives:

Performance Robustness Safety

Attack Success Rate (ASR)

The Attack Success Rate (ASR) measures the effectiveness of adversarial attacks against machine learning models. It is calculated as the percentage of attacks that successfully cause a model to misclassify or generate incorrect outputs. Thi...

Objectives:

Robustness Safety

HaRiM+

HaRiM+ is a reference-free evaluation metric that assesses the quality of generated summaries by estimating the hallucination risk within the summarization process. It uses a modified summarization model to measure how closely generated summaries align with...

Objectives:

Performance Robustness Safety

Aspect Critic

Aspect Critic is an evaluation metric used to assess responses based on predefined criteria, called “aspects,” written in natural language. This metric produces a binary output—either ‘Yes’ (1) or ‘No’ (0)—indicating whether the response meets the specified...

Objectives:

Performance Safety

Topic Adherence

Topic Adherence evaluates an AI system’s ability to confine its responses to predefined subject areas during interactions. This metric is crucial in applications where the AI is expected to assist only within specific domains, ensuring that responses remain...

Objectives:

Performance Human Agency & Control Safety

Faithfulness

Faithfulness is a metric that assesses the factual consistency of the model’s generated response with respect to the provided context. This metric ensures that every claim made in the answer can be supported or inferred from the context. The score ranges fr...

Objectives:

Performance Safety Explainability

nuScenes Detection Score (NDS)

The nuScenes Detection Score is a performance metric used to evaluate object detection algorithms in autonomous driving scenarios. It is used to evaluate the quality of object detection algorithms in the nuScenes dataset, which is a large-scale autonomous d...

Objectives:

Robustness Safety

Dice score

The Dice score, also known as the Dice Similarity Coefficient, is a measure of the similarity between two sets of data, usually represented as binary arrays. In the context of image segmentation, for example, the Dice score can be used to evaluate the simil...

Objectives:

Robustness Safety

Higher order tracking accuracy (HOTA)

In object tracking problems (e.g., "where is the human in this image?"), Higher order tracking accuracy (HOTA) measures how well the trajectories of matching detections align, and averages this over all matching detections, while also penalising detections ...

Objectives:

Robustness Safety

Multi-Object Tracking Accuracy (MOTA)

Multi-object tracking accuracy (MOTA) shows how many errors the tracker system has made in terms of misses, false positives, mismatch errors, etc. Therefore, it can be derived from three error ratios: the ratio of misses, the ratio of false positives, and t...

Objectives:

Robustness Safety

Natural Image Quality Evaluator (NIQE)

Natural image quality evaluator (NIQE) calculates the no-reference image quality score for images that may be distorted or of low perceptual quality. The contribution of this metric is derived from not requiring a known class of image distortions or percept...

Objectives:

Safety

Structural Similarity Index (SSIM)

The structural similarity index measure (SSIM) measures the perceived similarity of two images. When one image is a modified version of the other (e.g., if it is compressed) the SSIM serves as a measure of the fidelity of the compressed representation. The ...

Objectives:

Robustness Safety

False Rejection Rate (FRR)

False rejection rate (FRR) is a security metric used to measure the performance of biometric systems such as voice recognition, fingerprint recognition, face recognition, or iris recognition. It represents the likelihood of a biometric system mistakenly rej...

Objectives:

Fairness Safety

False Acceptance Rate (FAR)

False acceptance rate (FAR) is a security metric used to measure the performance of biometric systems such as voice recognition, fingerprint recognition, face recognition, or iris recognition. It represents the likelihood of a biometric system mistakenly ac...

Objectives:

Safety Digital Security

3D Pose Correct Keypoints

The '3DPCK' metric (3D Pose Correct Keypoints) is a performance metric used to evaluate the accuracy of 3D human pose estimation algorithms. It measures the percentage of keypoints for which the estimated 3D pose is within a certain distance from the ground...

Objectives:

Robustness Safety

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.