These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.
The Hughes Hallucination Evaluation Model (HHEM) Score is a metric designed to detect hallucinations in text generated by AI systems. It outputs a probability score between 0 and 1, where 0 indicates hallucination and 1 indicates factual consistency. The metric is particularly suitable for Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems. Vectara recommends a threshold of 0.5 to classify outputs as factually consistent.
Applicable Models
• Large Language Models (LLMs)
• Retrieval-Augmented Generation (RAG) systems
• Summarization models
• Natural Language Inference (NLI) models
Background
HHEM is built on Microsoft’s DeBERTa-v3-base model, fine-tuned on text summarization datasets after initial training on NLI data. It is optimized for detecting factual inconsistencies and has become a key tool in addressing hallucination detection challenges in generative AI.
Formulae
HHEM computes scores using a pretrained cross-encoder. The output is a probability score derived from input pairs (ground truth, inference).
Applications
• Hallucination detection in RAG pipelines
• Evaluation of factual consistency in summarization models
• Accuracy enhancement in enterprise LLM deployments
• Real-time inference scoring for generative AI systems
Impact
HHEM enables low-latency, cost-effective hallucination detection compared to LLM judge methods. Its calibration supports actionable probabilities, allowing users to fine-tune thresholds for specific applications. By offering multilingual support and efficient computation, HHEM contributes to the reliability and trustworthiness of generative AI systems.
References
Vectara. “hallucination_evaluation_model (Revision 7437011).” Published on Hugging Face, 2024. DOI: 10.57967/hf/3240. Accessible at: https://huggingface.co/vectara/hallucination_evaluation_model.
About the metric
You can click on the links to see the associated metrics
Metric type(s):
Objective(s):
Target sector(s):
Lifecycle stage(s):
Usage rights:
Target users:
Risk management stage(s):
Github stars:
- 1300
Github forks:
- 51
