These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.
BELLS - Benchmarks for the Evaluation of LLM Safeguards
The BELLS project from CeSIA aims to evaluate safeguards for large language models (LLMs) that detect undesired behaviors in model inputs and outputs. These safeguards serve as a testbed for scalable oversight of LLMs. BELLS currently focuses on evaluating two types of safeguards: jailbreak detectors and hallucination detectors.
The evaluations assess safeguard performance along two axes:
- Performance on established datasets of jailbreaks and hallucinations. This tests how well safeguards detect known failure modes.
- Generalization - specifically for jailbreaks - to new types published after the safeguard's release. This serves as a proxy for the tool's ability to catch future, unknown failure types.
Additionally, BELLS provides a leaderboard ranking the most effective safeguards. This guides users towards selecting the tools that offer the strongest protection against LLM harms.
Overall, the goal is to rigorously evaluate input-output safeguards, promoting wider adoption of oversight techniques that make LLMs more reliable and safe.
About the tool
You can click on the links to see the associated tools
Tool type(s):
Objective(s):
Impacted stakeholders:
Purpose(s):
Country of origin:
Lifecycle stage(s):
Type of approach:
Maturity:
Target groups:
Benefits:
Geographical scope:
Tags:
- evaluation
- large language model
- safety
- safeguards
Use Cases
Would you like to submit a use case for this tool?
If you have used this tool, we would love to know more about your experience.
Add use case