These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.
Bilingual Evaluation Understudy (BLEU) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine’s output and that of a human: “the closer a machine translation is to a professional human translation, the better it is” – this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.
Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. Those scores are then averaged over the whole corpus to reach an estimate of the translation’s overall quality. Neither intelligibility nor grammatical correctness are not taken into account.
Related use cases :
Better than Average: Paired Evaluation of NLP Systems
Uploaded on Oct 21, 2022Evaluation in NLP is usually done by comparing the scores of competing systems independently averaged over a common set of test instances. In this work, we question the use of ...
COMET: A Neural Framework for MT Evaluation
Uploaded on Oct 21, 2022We present COMET, a neural framework for training multilingual machine translation evaluation models which obtains new state-of-the-art levels of correlation with human judgeme...
Semantic-Based Self-Critical Training For Question Generation
Uploaded on Nov 1, 2022Question generation is a conditioned language generation task that consists in generating a context-aware question given a context and the targeted answer. Train language model...
BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations
Uploaded on Nov 1, 2023Controlling Hallucinations at Word Level in Data-to-Text Generation
Uploaded on Nov 1, 2023ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation
Uploaded on Nov 1, 2023Improving Sign Language Translation with Monolingual Data by Sign Back-Translation
Uploaded on Nov 1, 2023Machine Translation Pre-training for Data-to-Text Generation -- A Case Study in Czech
Uploaded on Nov 1, 2023May the Force Be with Your Copy Mechanism: Enhanced Supervised-Copy Method for Natural Language Generation
Uploaded on Nov 1, 2023Multimodal Pretraining for Dense Video Captioning
Uploaded on Nov 1, 2023NITS-VC System for VATEX Video Captioning Challenge 2020
Uploaded on Nov 1, 2023NeurST: Neural Speech Translation Toolkit
Uploaded on Nov 1, 2023RefineCap: Concept-Aware Refinement for Image Captioning
Uploaded on Nov 1, 2023Scaling Up Vision-Language Pre-training for Image Captioning
Uploaded on Nov 1, 2023About the metric
You can click on the links to see the associated metrics
Objective(s):
Target sector(s):
Lifecycle stage(s):
Target users: