Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

The BLEU score has some undesirable properties when used for single sentences, as it was designed to be a corpus measure. The Google BLEU score, also known as GLEU score, is designed to limit these undesirable properties when used for single sentences.

To calculate this score, all sub-sequences of 1, 2, 3 or 4 tokens in output and target sequence (n-grams) are recorded. The precision and recall, described below, are then computed.

  • precision: the ratio of the number of matching n-grams to the number of total n-grams in the generated output sequence
  • recall: the ratio of the number of matching n-grams to the number of total n-grams in the target (ground truth) sequence

The minimum value of precision and recall is then returned as the score.

About the metric


Objective(s):



Lifecycle stage(s):



Risk management stage(s):

Modify this metric

catalogue Logos

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.