Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

432 citations of this metric

GPTScore is a framework for evaluating the quality of text generated by large language models (LLMs). It uses the built-in capabilities of these models, like zero-shot learning and in-context learning, to provide flexible, training-free assessments tailored to user needs.

 

Applicable Models

 

This metric is designed for large language models.

 

Background

 

Existing methods to evaluate generated text often focus on narrow criteria, like fluency or relevance, and depend on labeled data or fine-tuned models. GPTScore simplifies this process by using LLMs’ ability to follow instructions and learn from examples without requiring additional training. This approach allows for more versatile and accessible text evaluation.

 

Formulae

 

GPTScore(h | d, a, S) = Σ (log P(ht | h<t, T(d, a, S), θ))

Where:

• h = generated text

• d = task description

• a = evaluation aspect (e.g., fluency, relevance)

• S = context (e.g., source or reference text)

• T(·) = prompt that defines the evaluation process

• θ = model parameters

 

Applications

• Evaluating summaries for readability and accuracy.

• Measuring relevance and factual accuracy in data-to-text outputs.

• Assessing dialogue responses for engagement and coherence.

• Comparing translations for correctness and fluency.

 

Impact

 

GPTScore makes it easier to evaluate text outputs from large language models. It removes the need for labeled data and model fine-tuning, saving time and resources. By supporting multi-dimensional evaluations, it ensures better alignment with human expectations and helps improve the design of generative AI systems.

Trustworthy AI Relevance

This metric addresses Robustness and Transparency by quantifying relevant system properties. Robustness: GPTScore measures output quality and consistency across examples and models; by quantifying correctness/coherence/preference it helps detect performance degradation, distribution shifts, or brittle outputs (e.g., inconsistent answers, hallucinations). As a comparative metric, it supports reliability assessments and regression testing, which are central to Robustness.

## AI Validation Analysis **Connection to Trustworthy AI Objectives:** Robustness: GPTScore provides quantitative, repeatable measures of output quality across facets (e.g., coherence, usefulness, creativity). Those measures can be used to track performance consistency, detect regressions under distribution shift, and compare models under varying conditions — all core to assessing robustness of generative systems. Transparency: GPTScore, especially when broken into named sub-scores (creativity, factuality, relevance) or accompanied by evaluator rationales, helps stakeholders understand how model outputs are being judged and which aspects drive differences between systems. That visibility improves interpretability of model evaluation and supports clearer communication about system behavior. Practical value and caveats: GPTScore is valuable for benchmarking, model selection, and continuous monitoring, making it a useful tool in trustworthy-AI workflows. However, because GPTScore often relies on LLM evaluators, it can introduce evaluator bias, calibration issues, and potential lack of reproducibility across evaluator versions or prompts. Therefore it should be used in combination with objective reference-based metrics, human evaluation, and safety-specific checks to form a comprehensive trusted-evaluation strategy. **Validation Score:** 4/5

References

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2024. GPTScore: Evaluate as You Desire. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6556–6576, Mexico City, Mexico. Association for Computational Linguistics.

About the metric


Metric type(s):





Lifecycle stage(s):




Risk management stage(s):


Github stars:

  • 235

Github forks:

  • 18

Modify this metric

Partnership on AI

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.