Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

Type

Clear all

Safety

Origin

Scope

SUBMIT A TOOL

If you have a tool that you think should be featured in the Catalogue of Tools & Metrics for Trustworthy AI, we would love to hear from you!

Submit
Objective Safety

TechnicalProceduralColombiaUploaded on Jun 9, 2026
Web application that allows organizations to assess their level of maturity in artificial intelligence governance and automatically generate a customized roadmap to meet national and international standards

TechnicalUnited StatesUploaded on Jun 9, 2026
Cloud platform for evaluating AI system performance on private data and presenting results

Related lifecycle stage(s)

Verify & validate

InternationalUploaded on Jun 4, 2026
Amazon Nova Premier is a multimodal foundation model that was evaluated under Amazon’s Frontier Model Safety Framework to assess and mitigate risks related to Chemical, Biological, Radiological, and Nuclear (CBRN) weapons proliferation, offensive cyber operations, and automated AI research and development.

TechnicalFranceUploaded on Jun 9, 2026
GovLLM is an open-source runtime governance framework for LLM systems. It treats regulatory compliance as a continuous signal derived from production observability rather than a static audit verdict. A panel of small language model judges evaluates each response against governance profiles.

TechnicalInternationalUploaded on Jun 3, 2026
The AI red team service exposes hidden safety and security threats across the entire lifecycle of artificial intelligence (AI) systems by applying an adversarial mindset to assess AI systems during design, development, deployment, and operations stages.

TechnicalInternationalUploaded on Jun 3, 2026
FlowMS is an AI-powered utility efficiency tool built on AWS that analyses metering data to detect anomalies in water use and help conserve water in Amazon buildings.

TechnicalUploaded on Jun 3, 2026
LLM Vulnerability Scanner and Guardrails provides comprehensive assessment of LLM vulnerabilities and automatic application of optimal defensive techniques to generative AI on LLMs.

Related lifecycle stage(s)

DeployVerify & validate

TechnicalUploaded on Jun 3, 2026
The Agentic Benchmark for CRM is a benchmarking framework developed by Salesforce to evaluate the performance of AI agents and models in enterprise customer relationship management (CRM) use cases using metrics such as accuracy, cost, speed, trust and safety, and sustainability.

TechnicalUploaded on Jun 3, 2026
The Moderation endpoint allows developers to classify text and image inputs to determine whether they may violate OpenAI’s safety policies.

Related lifecycle stage(s)

Operate & monitorDeploy

Uploaded on Jun 3, 2026
The Google Responsible Generative AI Toolkit provides tools and guidance to design, build and evaluate open AI models responsibly.

TechnicalUploaded on Jun 3, 2026
ShieldGemma is a set of instruction tuned models for evaluating the safety of text and images against a set of defined safety policies.

Objective(s)

Related lifecycle stage(s)

Operate & monitorDeploy

TechnicalUploaded on Jun 3, 2026
FACTS Grounding is a comprehensive benchmark for evaluating the ability of LLMs to generate responses that are not only factually accurate with respect to given inputs, but also sufficiently detailed to provide satisfactory answers to user queries.

Objective(s)

Related lifecycle stage(s)

Verify & validateBuild & interpret model

EducationalUploaded on Jun 3, 2026
The OWASP Top 10 for Large Language Model (LLM) Applications is a written guidance document developed by the OWASP community to identify the most critical security risks affecting applications that use large language models.

Related lifecycle stage(s)

Operate & monitorPlan & design

TechnicalUnited StatesUploaded on May 18, 2026
VERA-MH (Validation of Ethical and Responsible AI in Mental Health) is a comprehensive framework for evaluating AI chatbots in a mental health context.

TechnicalUploaded on Jun 3, 2026
The MLCommons AILuminate benchmark evaluates an AI system-under-test (SUT) by inputting a set of prompts, recording the SUT’s responses, and then using a specialized set of “safety evaluators models” to determine which of the responses are violations according to the AILuminate Assessment Standard guidelines. Findings are summarized in a human-readable report.

Objective(s)

Related lifecycle stage(s)

Verify & validate

TechnicalUploaded on Mar 20, 2026
garak is an open-source LLM vulnerability scanner developed by NVIDIA that probes large language models for security weaknesses including prompt injection, jailbreaks, hallucination, toxicity, data leakage, and misinformation.

TechnicalProceduralUploaded on Jun 3, 2026
AuditNLG is an open-source toolkit for auditing the trustworthiness of generative AI text. It evaluates outputs across three key dimensions: factualness (consistency with knowledge), safety (harmful or biased content), and constraint adherence (compliance with instructions). The tool aggregates multiple state-of-the-art methods and provides scores, explanations, and improved text suggestions via self-refinement prompts. It supports both API-based and local models, enabling flexible integration into evaluation pipelines and governance frameworks.

TechnicalUploaded on Mar 20, 2026
OpenEnv is a framework for evaluating AI agents against real systems rather than simulations. It provides a standardised way to connect agents to real tools and workflows while preserving the structure needed for consistent and reliable evaluation.

Objective(s)

Related lifecycle stage(s)

Operate & monitorDeployVerify & validate

ProceduralUploaded on Feb 16, 2026
The AI Inherent Risk Scale (AIIRS) is a task-based classification instrument that helps organisations assess the inherent risk of generative AI use. It evaluates tasks against three criteria—epistemic dependence, verifiability, and consequences of error—and assigns a LOW, MEDIUM, or HIGH risk rating using a max-dominant model. AIIRS supports proportionate safeguards, accountable oversight, and governance-aligned decision-making without determining whether AI use is permitted in a given context.

Related lifecycle stage(s)

Operate & monitor

ProceduralUploaded on Jan 15, 2026
WasItAI is an image-checker designed to detect AI-generated photos.

Objective(s)


Partnership on AI

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.