Reject Rate (RR)

Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

Overview Tools Metrics About the catalogue

19 citations of this metric

Github

Website

The Reject Rate is a metric used to evaluate the frequency at which a large language model (LLM) refuses to provide a response to a query. It is particularly relevant in scenarios where refusal is expected to mitigate risks associated with unsafe, biased, or non-ethical outputs. This metric helps measure the robustness and safety of an LLM by analyzing its ability to appropriately reject or decline unsafe requests.

Applicable Models:
Reject Rate is commonly applied to large language models, particularly those designed for conversational AI or content generation tasks.

Background:
With the increasing deployment of LLMs across diverse sectors, ensuring they do not generate harmful or inappropriate content has become a priority. Reject Rate evaluates refusal mechanisms in response to potentially unsafe, adversarial, or ambiguous inputs. Research such as SORRY-Bench and OR-Bench (e.g., Xie et al., 2024; Cui et al., 2024) underscores the importance of balancing refusal behavior with performance.

Formulae:
The Reject Rate is computed as:

Reject Rate = (Number of Rejected Queries) / (Total Number of Queries)

Where:

Rejected Queries: Instances where the model explicitly refuses to provide an output.
Total Queries: All inputs provided to the model during evaluation.

Applications:

Robustness Assessment: Evaluating how well a model handles adversarial or ambiguous inputs.
Safety Checks: Ensuring the model avoids generating harmful or biased content.
System Tuning: Fine-tuning refusal mechanisms in AI systems for compliance with ethical and regulatory guidelines.

Impact:
The Reject Rate is critical for maintaining the safety and reliability of AI systems. A well-calibrated reject rate ensures models refuse unsafe queries while minimizing the over-refusal of benign ones, which could degrade usability and user experience. It also provides transparency to developers and users about the system's safety mechanisms.

References

Xie, T., Qi, X., Zeng, Y., Huang, Y., Sehwag, U. M., Huang, K., He, L., Wei, B., Li, D., Sheng, Y., Jia, R., Li, B., Li, K., Chen, D., Henderson, P., & Mittal, P. (2024). SORRY-Bench: Systematically evaluating large language model safety refusal behaviors. arXiv preprint arXiv:2406.14598. https://arxiv.org/abs/2406.14598
Cui, J., Chiang, W.-L., Stoica, I., & Hsieh, C.-J. (2024). OR-Bench: An over-refusal benchmark for large language models. arXiv preprint arXiv:2405.20947. https://arxiv.org/abs/2405.20947

About the metric

You can click on the links to see the associated metrics

Metric type(s):

Technical

Objective(s):

Robustness
Safety

Purpose(s):

Interaction support/chatbots
Content generation

Target sector(s):

Public governance
Innovation
Health
Education

Lifecycle stage(s):

Operate & monitor
Verify & validate

Usage rights:

Open source/Permissive

Target users:

Data scientist
Developer
Researcher

Risk management stage(s):

Assess risks & impacts
Assess
Govern

Github stars:

Modify this metric

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.