These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.
GPTScore is a framework for evaluating the quality of text generated by large language models (LLMs). It uses the built-in capabilities of these models, like zero-shot learning and in-context learning, to provide flexible, training-free assessments tailored to user needs.
Applicable Models
This metric is designed for large language models.
Background
Existing methods to evaluate generated text often focus on narrow criteria, like fluency or relevance, and depend on labeled data or fine-tuned models. GPTScore simplifies this process by using LLMs’ ability to follow instructions and learn from examples without requiring additional training. This approach allows for more versatile and accessible text evaluation.
Formulae
GPTScore(h | d, a, S) = Σ (log P(ht | h<t, T(d, a, S), θ))
Where:
• h = generated text
• d = task description
• a = evaluation aspect (e.g., fluency, relevance)
• S = context (e.g., source or reference text)
• T(·) = prompt that defines the evaluation process
• θ = model parameters
Applications
• Evaluating summaries for readability and accuracy.
• Measuring relevance and factual accuracy in data-to-text outputs.
• Assessing dialogue responses for engagement and coherence.
• Comparing translations for correctness and fluency.
Impact
GPTScore makes it easier to evaluate text outputs from large language models. It removes the need for labeled data and model fine-tuning, saving time and resources. By supporting multi-dimensional evaluations, it ensures better alignment with human expectations and helps improve the design of generative AI systems.
References
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2024. GPTScore: Evaluate as You Desire. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6556–6576, Mexico City, Mexico. Association for Computational Linguistics.
About the metric
You can click on the links to see the associated metrics
Metric type(s):
Objective(s):
Purpose(s):
Target sector(s):
Lifecycle stage(s):
Usage rights:
Target users:
Risk management stage(s):
Github stars:
- 235
Github forks:
- 18
