Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

432 citations of this metric

GPTScore is a framework for evaluating the quality of text generated by large language models (LLMs). It uses the built-in capabilities of these models, like zero-shot learning and in-context learning, to provide flexible, training-free assessments tailored to user needs.

 

Applicable Models

 

This metric is designed for large language models.

 

Background

 

Existing methods to evaluate generated text often focus on narrow criteria, like fluency or relevance, and depend on labeled data or fine-tuned models. GPTScore simplifies this process by using LLMs’ ability to follow instructions and learn from examples without requiring additional training. This approach allows for more versatile and accessible text evaluation.

 

Formulae

 

GPTScore(h | d, a, S) = Σ (log P(ht | h<t, T(d, a, S), θ))

Where:

• h = generated text

• d = task description

• a = evaluation aspect (e.g., fluency, relevance)

• S = context (e.g., source or reference text)

• T(·) = prompt that defines the evaluation process

• θ = model parameters

 

Applications

• Evaluating summaries for readability and accuracy.

• Measuring relevance and factual accuracy in data-to-text outputs.

• Assessing dialogue responses for engagement and coherence.

• Comparing translations for correctness and fluency.

 

Impact

 

GPTScore makes it easier to evaluate text outputs from large language models. It removes the need for labeled data and model fine-tuning, saving time and resources. By supporting multi-dimensional evaluations, it ensures better alignment with human expectations and helps improve the design of generative AI systems.

References

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2024. GPTScore: Evaluate as You Desire. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6556–6576, Mexico City, Mexico. Association for Computational Linguistics.

About the metric


Metric type(s):





Lifecycle stage(s):




Risk management stage(s):


Github stars:

  • 235

Github forks:

  • 18

Modify this metric

catalogue Logos

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.