Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

19 citations of this metric

The Reject Rate is a metric used to evaluate the frequency at which a large language model (LLM) refuses to provide a response to a query. It is particularly relevant in scenarios where refusal is expected to mitigate risks associated with unsafe, biased, or non-ethical outputs. This metric helps measure the robustness and safety of an LLM by analyzing its ability to appropriately reject or decline unsafe requests.
 

Applicable Models:
Reject Rate is commonly applied to large language models, particularly those designed for conversational AI or content generation tasks.
 

Background:
With the increasing deployment of LLMs across diverse sectors, ensuring they do not generate harmful or inappropriate content has become a priority. Reject Rate evaluates refusal mechanisms in response to potentially unsafe, adversarial, or ambiguous inputs. Research such as SORRY-Bench and OR-Bench (e.g., Xie et al., 2024; Cui et al., 2024) underscores the importance of balancing refusal behavior with performance.
 

Formulae:
The Reject Rate is computed as:

Reject Rate = (Number of Rejected Queries) / (Total Number of Queries)

Where:

  • Rejected Queries: Instances where the model explicitly refuses to provide an output.
  • Total Queries: All inputs provided to the model during evaluation.
     

Applications:

  • Robustness Assessment: Evaluating how well a model handles adversarial or ambiguous inputs.
  • Safety Checks: Ensuring the model avoids generating harmful or biased content.
  • System Tuning: Fine-tuning refusal mechanisms in AI systems for compliance with ethical and regulatory guidelines.
     

Impact:
The Reject Rate is critical for maintaining the safety and reliability of AI systems. A well-calibrated reject rate ensures models refuse unsafe queries while minimizing the over-refusal of benign ones, which could degrade usability and user experience. It also provides transparency to developers and users about the system's safety mechanisms.

References

  • Xie, T., Qi, X., Zeng, Y., Huang, Y., Sehwag, U. M., Huang, K., He, L., Wei, B., Li, D., Sheng, Y., Jia, R., Li, B., Li, K., Chen, D., Henderson, P., & Mittal, P. (2024). SORRY-Bench: Systematically evaluating large language model safety refusal behaviors. arXiv preprint arXiv:2406.14598. https://arxiv.org/abs/2406.14598
     
  • Cui, J., Chiang, W.-L., Stoica, I., & Hsieh, C.-J. (2024). OR-Bench: An over-refusal benchmark for large language models. arXiv preprint arXiv:2405.20947. https://arxiv.org/abs/2405.20947

About the metric


Metric type(s):









Github stars:

  • 36

Modify this metric

catalogue Logos

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.