Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

19 citations of this metric

The Reject Rate is a metric used to evaluate the frequency at which a large language model (LLM) refuses to provide a response to a query. It is particularly relevant in scenarios where refusal is expected to mitigate risks associated with unsafe, biased, or non-ethical outputs. This metric helps measure the robustness and safety of an LLM by analyzing its ability to appropriately reject or decline unsafe requests.
 

Applicable Models:
Reject Rate is commonly applied to large language models, particularly those designed for conversational AI or content generation tasks.
 

Background:
With the increasing deployment of LLMs across diverse sectors, ensuring they do not generate harmful or inappropriate content has become a priority. Reject Rate evaluates refusal mechanisms in response to potentially unsafe, adversarial, or ambiguous inputs. Research such as SORRY-Bench and OR-Bench (e.g., Xie et al., 2024; Cui et al., 2024) underscores the importance of balancing refusal behavior with performance.
 

Formulae:
The Reject Rate is computed as:

Reject Rate = (Number of Rejected Queries) / (Total Number of Queries)

Where:

  • Rejected Queries: Instances where the model explicitly refuses to provide an output.
  • Total Queries: All inputs provided to the model during evaluation.
     

Applications:

  • Robustness Assessment: Evaluating how well a model handles adversarial or ambiguous inputs.
  • Safety Checks: Ensuring the model avoids generating harmful or biased content.
  • System Tuning: Fine-tuning refusal mechanisms in AI systems for compliance with ethical and regulatory guidelines.
     

Impact:
The Reject Rate is critical for maintaining the safety and reliability of AI systems. A well-calibrated reject rate ensures models refuse unsafe queries while minimizing the over-refusal of benign ones, which could degrade usability and user experience. It also provides transparency to developers and users about the system's safety mechanisms.

Trustworthy AI Relevance

This metric addresses Robustness and Human Agency & Control by quantifying relevant system properties. Robustness: RR directly relates to robustness because selective rejection/abstention is a common mechanism to handle uncertainty, OOD inputs, and adversarial or noisy conditions. A well-calibrated reject rate indicates the model’s ability to withhold predictions under adverse or unfamiliar conditions, improving reliability under distribution shift.

## AI Validation Analysis **Connection to Trustworthy AI Objectives:** Robustness: RR quantifies the system's abstention behavior and thus its ability to detect uncertainty, out-of-distribution inputs, or conditions under which the model should not make an automated decision. Monitoring RR (and its relationship with error rates on non-rejected items) helps evaluate and tune resilience to distribution shift, noisy inputs, and adversarial or ambiguous cases. Human Agency & Control: RR measures how often the system defers decisions to a human (or other fallback). It operationalizes human-in-the-loop safeguards and autonomy by making explicit the frequency and conditions under which human oversight is invoked, enabling policies that preserve user control and enable appropriate escalation. Practical value: RR is actionable (set thresholds, trade off coverage vs. safety), easy to compute, and valuable for deployment decisions. Caveats: RR must be interpreted with context (desired coverage, cost of human review), broken down by subgroup and input type to avoid unfair or opaque deferral patterns, and paired with complementary metrics (accuracy on accepted items, harm rates) to fully assess trustworthiness. **Validation Score:** 4/5

References

  • Xie, T., Qi, X., Zeng, Y., Huang, Y., Sehwag, U. M., Huang, K., He, L., Wei, B., Li, D., Sheng, Y., Jia, R., Li, B., Li, K., Chen, D., Henderson, P., & Mittal, P. (2024). SORRY-Bench: Systematically evaluating large language model safety refusal behaviors. arXiv preprint arXiv:2406.14598. https://arxiv.org/abs/2406.14598
     
  • Cui, J., Chiang, W.-L., Stoica, I., & Hsieh, C.-J. (2024). OR-Bench: An over-refusal benchmark for large language models. arXiv preprint arXiv:2405.20947. https://arxiv.org/abs/2405.20947

About the metric

Modify this metric

Partnership on AI

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.