These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.
FACTS Grounding: A new benchmark for evaluating the factuality of large language models
FACTS Grounding tool is a comprehensive benchmark for evaluating the ability of LLMs to generate responses that are not only factually accurate with respect to given inputs, but also sufficiently detailed to provide satisfactory answers to user queries. It aims to address the problem of hallucinations, where AI systems produce incorrect or unsupported information. The benchmark uses a dataset of 1,719 examples, each including a document, system instructions, and a user request requiring answers based only on the provided text. The dataset covers multiple domains such as finance, technology, medicine, retail, and law, and includes tasks like summarization, question answering, and rewriting.
Model responses are evaluated using three advanced LLM judges: Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet, to mitigate any potential bias of a judge giving higher scores to the responses produced by a member of its own model family. The evaluation occurs in two steps: first checking whether the response properly addresses the user request, and then verifying that the response is fully supported by the source document without hallucinations.
Scores from the judge models are aggregated to determine overall performance. The results are published on the FACTS leaderboard hosted on Kaggle, allowing comparison of model performance. The benchmark includes both public and private evaluation sets to reduce risks of benchmark contamination. Overall, FACTS Grounding helps track progress in improving the factual reliability and trustworthiness of AI systems.
About the tool
You can click on the links to see the associated tools
Developing organisation(s):
Tool type(s):
Objective(s):
Lifecycle stage(s):
Type of approach:
Maturity:
Usage rights:
Target groups:
Target users:
Stakeholder group:
Geographical scope:
Risk management stage(s):
Tags:
- ai quality
- Accuracy and performance
- ai
- large langage models
- ai evaluation
Use Cases
Would you like to submit a use case for this tool?
If you have used this tool, we would love to know more about your experience.
Add use case



























