BERTscore

Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

Overview Tools Metrics About the catalogue

Website

BERTScore is an automatic evaluation metric for text generation that computes a similarity score for each token in the candidate sentence with each token in the reference sentence. It leverages the pre-trained contextual embeddings from BERT models and matches words in candidate and reference sentences by cosine similarity. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.

BERTscore can indirectly support Explainability by offering a measurable, quantitative assessment of how closely a model's output matches a reference, which can help users understand model performance. However, it does not provide explanations for individual decisions or model behavior, limiting its impact on this objective.

Trustworthy AI Relevance

This metric addresses Transparency, Robustness by quantifying relevant system properties. BERTScore contributes to Transparency by providing a clear, interpretable metric that helps developers and stakeholders understand how well an AI system's text outputs align semantically with expected results, thus improving openness about system performance. It supports Robustness by enabling evaluation of AI models under various conditions and inputs, helping to detect when models fail to produce semantically coherent outputs, thereby enhancing reliability and resilience of NLP systems..

Related use cases :

Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation

Uploaded on Nov 1, 2022

Neural image-to-text radiology report generation systems offer the potential to improve radiology reporting by reducing the repetitive process of report drafting and identifyin...

About the metric

You can click on the links to see the associated metrics

Objective(s):

Robustness

Purpose(s):

Recognition/object detection

Target sector(s):

Health

Lifecycle stage(s):

Build & interpret model

Target users:

Developer

Risk management stage(s):

Assess

Modify this metric

Partnership on AI

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.