Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

Overview Tools Metrics About the catalogue

FACTS Grounding: A new benchmark for evaluating the factuality of large language models

Website

Github

FACTS Grounding tool is a comprehensive benchmark for evaluating the ability of LLMs to generate responses that are not only factually accurate with respect to given inputs, but also sufficiently detailed to provide satisfactory answers to user queries. It aims to address the problem of hallucinations, where AI systems produce incorrect or unsupported information. The benchmark uses a dataset of 1,719 examples, each including a document, system instructions, and a user request requiring answers based only on the provided text. The dataset covers multiple domains such as finance, technology, medicine, retail, and law, and includes tasks like summarization, question answering, and rewriting.

Model responses are evaluated using three advanced LLM judges: Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet, to mitigate any potential bias of a judge giving higher scores to the responses produced by a member of its own model family. The evaluation occurs in two steps: first checking whether the response properly addresses the user request, and then verifying that the response is fully supported by the source document without hallucinations.

Scores from the judge models are aggregated to determine overall performance. The results are published on the FACTS leaderboard hosted on Kaggle, allowing comparison of model performance. The benchmark includes both public and private evaluation sets to reduce risks of benchmark contamination. Overall, FACTS Grounding helps track progress in improving the factual reliability and trustworthiness of AI systems.

About the tool

You can click on the links to see the associated tools

Developing organisation(s):

Google DeepMind

Tool type(s):

Technical validation
Rating framework

Objective(s):

Robustness
Safety

Lifecycle stage(s):

Verify & validate
Build & interpret model

Type of approach:

Technical

Maturity:

Published document

Usage rights:

Free of charge

Target groups:

Technical community

Target users:

Developer

Stakeholder group:

Technical community

Benefits:

Increased quality results
Reduction in risk of failure

Geographical scope:

International

Risk management stage(s):

Govern: Monitor and review risks & impacts
Assess risks & impacts

Tags:

ai quality
Accuracy and performance
ai
large langage models
ai evaluation

Modify this tool

Use Cases

There is no use cases for this tool yet.

Would you like to submit a use case for this tool?

If you have used this tool, we would love to know more about your experience.

Add use case

Partnership on AI

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.