Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

Type

Clear all

Robustness

Origin

Scope

SUBMIT A TOOL

If you have a tool that you think should be featured in the Catalogue of Tools & Metrics for Trustworthy AI, we would love to hear from you!

Submit
Objective Robustness

InternationalUploaded on Jun 4, 2026
Amazon Nova Premier is a multimodal foundation model that was evaluated under Amazon’s Frontier Model Safety Framework to assess and mitigate risks related to Chemical, Biological, Radiological, and Nuclear (CBRN) weapons proliferation, offensive cyber operations, and automated AI research and development.

TechnicalNorwayUploaded on Jun 12, 2026
Lightweight AI safety auditing framework for red-teaming AI systems through adversarial probing. Supports multilingual testing across safety, healthcare, and RAG scenarios, and works with cloud APIs or fully local models.

Related lifecycle stage(s)

DeployVerify & validate

TechnicalUploaded on Jun 3, 2026
ShieldGemma is a set of instruction tuned models for evaluating the safety of text and images against a set of defined safety policies.

Objective(s)

Related lifecycle stage(s)

Operate & monitorDeploy

TechnicalUploaded on Jun 3, 2026
FACTS Grounding is a comprehensive benchmark for evaluating the ability of LLMs to generate responses that are not only factually accurate with respect to given inputs, but also sufficiently detailed to provide satisfactory answers to user queries.

Objective(s)

Related lifecycle stage(s)

Verify & validateBuild & interpret model

TechnicalUploaded on Jun 3, 2026
Eureka is a reusable and open evaluation framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings.

TechnicalUploaded on Jun 3, 2026
The MLCommons AILuminate benchmark evaluates an AI system-under-test (SUT) by inputting a set of prompts, recording the SUT’s responses, and then using a specialized set of “safety evaluators models” to determine which of the responses are violations according to the AILuminate Assessment Standard guidelines. Findings are summarized in a human-readable report.

Objective(s)

Related lifecycle stage(s)

Verify & validate

TechnicalUploaded on Mar 20, 2026
garak is an open-source LLM vulnerability scanner developed by NVIDIA that probes large language models for security weaknesses including prompt injection, jailbreaks, hallucination, toxicity, data leakage, and misinformation.

TechnicalProceduralUploaded on Jun 3, 2026
AuditNLG is an open-source toolkit for auditing the trustworthiness of generative AI text. It evaluates outputs across three key dimensions: factualness (consistency with knowledge), safety (harmful or biased content), and constraint adherence (compliance with instructions). The tool aggregates multiple state-of-the-art methods and provides scores, explanations, and improved text suggestions via self-refinement prompts. It supports both API-based and local models, enabling flexible integration into evaluation pipelines and governance frameworks.

TechnicalUploaded on Mar 20, 2026
OpenEnv is a framework for evaluating AI agents against real systems rather than simulations. It provides a standardised way to connect agents to real tools and workflows while preserving the structure needed for consistent and reliable evaluation.

Objective(s)

Related lifecycle stage(s)

Operate & monitorDeployVerify & validate

TechnicalProceduralUploaded on Mar 20, 2026
The Approved Intelligence Platform (AIP) provides modular, scenario-based testing workflows to evaluate mission-critical AI systems in defence, public safety, and critical civil use cases. It delivers a comprehensive, end-to-end testing environment based on a proprietary AI trust ontology with measurable AI Solutions Quality Indicators (ASQI) for the testing, evaluation, validation and verification of software solutions with different AI modalities.

Objective(s)

Related lifecycle stage(s)

Operate & monitorDeployVerify & validate

TechnicalProceduralUploaded on Mar 20, 2026
The Resaro AI Solutions Quality Index (ASQI) provides a transparent, use-case-specific measure of AI quality — for applications such as customer chat services, object recognition, deepfake detection, or x-ray anomaly identification.

TechnicalEducationalUploaded on Aug 27, 2025
AI Screener to enable universal early screening for all children.

Objective(s)

Related lifecycle stage(s)

Plan & design

ProceduralUploaded on Aug 1, 2025
BeSpecial is an AI-driven platform designed to support university students with dyslexia by providing personalized digital tools and tailored learning strategies. Developed within the European VRAILEXIA project, BeSpecial combines clinical data, self-assessments, and psychometric tests to recommend customized resources like audiobooks and concept maps, as well as inclusive academic practices. The platform also raises awareness and trains educators to foster inclusive higher education environments.

Related lifecycle stage(s)

Operate & monitorDeploy

AustraliaUploaded on May 22, 2025
FloodMapp is a technology company that specialises in rapid real-time flood forecasting and flood inundation mapping to provide greater warning time and situational awareness.

Objective(s)


TechnicalEducationalMexicoUnited StatesIsraelUploaded on May 19, 2025
SeismicAI is a provider of innovative Earthquake Early Warning Systems (EEW) ensuring earthquake preparedness. SeismicAI's algorithms utilise local sensors to issue high-precision alerts for earthquake preparedness. The system covers the full early warning cycle - from monitoring and reporting, through alerts, to optionally triggering automated preventive actions.

Objective(s)

Related lifecycle stage(s)

Operate & monitor

TechnicalEuropeUploaded on May 19, 2025
The AIFS is the first fully operational weather prediction open model using machine learning technology for weather forecasting.

TechnicalUnited StatesUploaded on May 15, 2025
The GDA leverages aerial imagery, satellite data, and machine learning techniques to evaluate the damage in areas impacted by natural disasters. This tool greatly enhances the efficiency and precision of disaster response operations.

TechnicalUnited StatesUploaded on May 2, 2025
ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) is a globally accessible, living knowledge base of adversary tactics and techniques against Al-enabled systems based on real-world attack observations and realistic demonstrations from Al red teams and security groups.

ProceduralCanadaUploaded on Mar 31, 2025
This program provides organisations with a comprehensive, independent review of their AI approaches, ensuring alignment with consensus standards and enhancing trust among stakeholders and the public in their AI practices.

Related lifecycle stage(s)

Verify & validate

TechnicalUnited StatesUploaded on Jan 8, 2025
MLPerf Client is a benchmark for Windows and macOS, focusing on client form factors in ML inference scenarios like AI chatbots, image classification, etc. The benchmark evaluates performance across different hardware and software configurations, providing command line interface.

Partnership on AI

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.