Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

FACTS Grounding: A new benchmark for evaluating the factuality of large language models



FACTS Grounding tool is a comprehensive benchmark for evaluating the ability of LLMs to generate responses that are not only factually accurate with respect to given inputs, but also sufficiently detailed to provide satisfactory answers to user queries. It aims to address the problem of hallucinations, where AI systems produce incorrect or unsupported information. The benchmark uses a dataset of 1,719 examples, each including a document, system instructions, and a user request requiring answers based only on the provided text. The dataset covers multiple domains such as finance, technology, medicine, retail, and law, and includes tasks like summarization, question answering, and rewriting. 

Model responses are evaluated using three advanced LLM judges: Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet, to mitigate any potential bias of a judge giving higher scores to the responses produced by a member of its own model family. The evaluation occurs in two steps: first checking whether the response properly addresses the user request, and then verifying that the response is fully supported by the source document without hallucinations.

Scores from the judge models are aggregated to determine overall performance. The results are published on the FACTS leaderboard hosted on Kaggle, allowing comparison of model performance. The benchmark includes both public and private evaluation sets to reduce risks of benchmark contamination. Overall, FACTS Grounding helps track progress in improving the factual reliability and trustworthiness of AI systems.

About the tool


Developing organisation(s):





Type of approach:



Usage rights:


Target groups:


Target users:


Stakeholder group:



Geographical scope:



Tags:

  • ai quality
  • Accuracy and performance
  • ai
  • large langage models
  • ai evaluation

Modify this tool

Use Cases

There is no use cases for this tool yet.

Would you like to submit a use case for this tool?

If you have used this tool, we would love to know more about your experience.

Add use case
Partnership on AI

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.