Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

Overview Tools Metrics About the catalogue

BELLS - Benchmarks for the Evaluation of LLM Safeguards

Github

The BELLS project from CeSIA aims to evaluate safeguards for large language models (LLMs) that detect undesired behaviors in model inputs and outputs. These safeguards serve as a testbed for scalable oversight of LLMs. BELLS currently focuses on evaluating two types of safeguards: jailbreak detectors and hallucination detectors.

The evaluations assess safeguard performance along two axes:

Performance on established datasets of jailbreaks and hallucinations. This tests how well safeguards detect known failure modes.
Generalization - specifically for jailbreaks - to new types published after the safeguard's release. This serves as a proxy for the tool's ability to catch future, unknown failure types.

Additionally, BELLS provides a leaderboard ranking the most effective safeguards. This guides users towards selecting the tools that offer the strongest protection against LLM harms.

Overall, the goal is to rigorously evaluate input-output safeguards, promoting wider adoption of oversight techniques that make LLMs more reliable and safe.

About the tool

You can click on the links to see the associated tools

Tool type(s):

Toolkit/software
Other

Objective(s):

Robustness
Safety
Transparency

Impacted stakeholders:

Consumers
Management

Purpose(s):

Event/anomaly detection
Content generation

Country of origin:

France

Lifecycle stage(s):

Operate & monitor
Verify & validate

Type of approach:

Technical

Maturity:

In development

Target groups:

Private sector
Public sector
Technical community

Stakeholder group:

Academia
Other

Benefits:

Increased quality results

Geographical scope:

All countries & regions

Tags:

evaluation
large language model
safety
safeguards

Modify this tool

Use Cases

There is no use cases for this tool yet.

Would you like to submit a use case for this tool?

If you have used this tool, we would love to know more about your experience.

Add use case

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.