SacreBLEU

Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

Overview Tools Metrics About the catalogue

Github

Website

SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich’s multi-bleu-detok.perl, it produces the official Workshop on Machine Translation (WMT) scores but works with plain text. It also knows all the standard test sets and handles downloading, processing, and tokenization.

Trustworthy AI Relevance

This metric addresses Robustness and Transparency by quantifying relevant system properties. Robustness: SacreBLEU quantifies translation quality in a repeatable way and can be used to track model performance consistency across datasets, domains, and noisy or OOD inputs. Drops or instability in sacreBLEU over different conditions indicate reduced reliability and resilience, making it useful for assessing model robustness (consistency/reliability) even though it is a surface-form metric.

Related use cases :

SLTEV: Comprehensive Evaluation of Spoken Language Translation

Uploaded on Nov 1, 2022

Automatic evaluation of Machine Translation (MT) quality has been investigated over several decades. Spoken Language Translation (SLT), esp. when simultaneous, needs to conside...

CAiRE in DialDoc21: Data Augmentation for Information-Seeking Dialogue System

Uploaded on Nov 1, 2022

Information-seeking dialogue systems, including knowledge identification and response generation, aim to respond to users with fluent, coherent, and informative responses based...

About the metric

You can click on the links to see the associated metrics

Objective(s):

Robustness
Transparency

Purpose(s):

Forecasting/prediction
Recognition/object detection

Lifecycle stage(s):

Operate & monitor
Verify & validate

Target users:

Data scientist
Developer
Project manager
System operators

Risk management stage(s):

Define
Assess
Treat

Modify this metric

Partnership on AI

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.