Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

BELLS - Benchmarks for the Evaluation of LLM Safeguards



BELLS - Benchmarks for the Evaluation of LLM Safeguards

The BELLS project from CeSIA aims to evaluate safeguards for large language models (LLMs) that detect undesired behaviors in model inputs and outputs. These safeguards serve as a testbed for scalable oversight of LLMs. BELLS currently focuses on evaluating two types of safeguards: jailbreak detectors and hallucination detectors.

The evaluations assess safeguard performance along two axes:

  • Performance on established datasets of jailbreaks and hallucinations. This tests how well safeguards detect known failure modes.
  • Generalization - specifically for jailbreaks - to new types published after the safeguard's release. This serves as a proxy for the tool's ability to catch future, unknown failure types.

Additionally, BELLS provides a leaderboard ranking the most effective safeguards. This guides users towards selecting the tools that offer the strongest protection against LLM harms.

Overall, the goal is to rigorously evaluate input-output safeguards, promoting wider adoption of oversight techniques that make LLMs more reliable and safe.

About the tool




Impacted stakeholders:



Country of origin:



Type of approach:




Stakeholder group:



Geographical scope:


Tags:

  • evaluation
  • large language model
  • safety
  • safeguards

Modify this tool

Use Cases

There is no use cases for this tool yet.

Would you like to submit a use case for this tool?

If you have used this tool, we would love to know more about your experience.

Add use case
catalogue Logos

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.