These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.
Prometheus is an open-source evaluator model metric fine-tuned on feedback data to perform fine-grained evaluations of AI-generated text. Designed to offer transparency, cost-efficiency, and reproducibility, it matches proprietary models like GPT-4 in evaluation accuracy when provided with detailed reference materials. Prometheus 2 extends its predecessor’s capabilities by supporting both direct assessment and pairwise ranking formats, demonstrating strong performance across benchmarks and closing the gap with proprietary models.
Applicable Models:
Prometheus and Prometheus 2 are designed for evaluating the outputs of large language models (LLMs) across various domains, including general-purpose and domain-specific applications. They work effectively with models like GPT-3.5, GPT-4, Claude, Llama-2, and Mistral.
Background:
The use of proprietary models for evaluation has raised concerns around transparency, cost, and reproducibility. Prometheus addresses these by leveraging the Feedback Collection dataset for training, focusing on fine-grained, user-defined criteria. Prometheus 2 introduces the Preference Collection dataset, enabling a unified evaluation across multiple formats.
Formulae:
Prometheus relies on supervised fine-tuning with datasets providing:
• Direct Assessment: Scalar scores between 1-5.
• Pairwise Ranking: Binary outcomes determining which response is better.
Weight merging techniques are employed in Prometheus 2 to combine separate models trained for these formats.
Applications:
• Evaluating AI-generated responses based on user-defined criteria like helpfulness, accuracy, and creativity.
• Supporting domain-specific evaluations, such as coding, scientific explanations, or cultural sensitivity.
• Acting as a reward model for reinforcement learning from human feedback (RLHF) pipelines.
• Training developers and researchers in creating precise and transparent evaluation frameworks.
Impact:
Prometheus models democratize LLM evaluation by removing dependency on proprietary systems, promoting fairness and accessibility. They enhance reproducibility, reduce costs, and provide flexibility for diverse evaluation needs. Prometheus 2 further expands these benefits by integrating evaluation formats, fostering deeper insights into AI system performance.
References
Kim, S., Shin, J., Cho, Y., Jang, J., Longpre, S., Lee, H., ... & Seo, M. (2023, October). Prometheus: Inducing fine-grained evaluation capability in language models. In The Twelfth International Conference on Learning Representations.
Kim, S., Suk, J., Longpre, S., Lin, B. Y., Shin, J., Welleck, S., ... & Seo, M. (2024). Prometheus 2: An open source language model specialized in evaluating other language models. arXiv preprint arXiv:2405.01535.
About the metric
You can click on the links to see the associated metrics
Metric type(s):
Objective(s):
Purpose(s):
Target sector(s):
Lifecycle stage(s):
Usage rights:
Target users:
Risk management stage(s):
Github stars:
- 293
Github forks:
- 17
