Catalogue of Tools & Metrics for Trustworthy AI

These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.

The Attack Success Rate (ASR) measures the effectiveness of adversarial attacks against machine learning models. It is calculated as the percentage of attacks that successfully cause a model to misclassify or generate incorrect outputs. This metric is pivotal in evaluating the robustness of models to adversarial perturbations.

 

Applicable Models:

 

ASR is relevant to multiple types of models, including:

Image Classification Models: Convolutional Neural Networks (CNNs) and their extensions.

Natural Language Processing Models: Transformers (e.g., BERT, GPT, T5) and RNNs.

Generative Models: Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs).

Large Language Models (LLMs): Models such as GPT-4, PaLM, or LLama, which perform tasks like text generation, summarization, and classification.

 

Background:

 

The study of ASR originated in adversarial machine learning research, focusing on perturbations that deceive models into errors. For LLMs, adversarial attacks often involve:

1. Prompt Injection Attacks: Modifying user prompts to elicit unintended responses.

2. Toxicity or Bias Exploits: Triggering biased or harmful outputs by carefully crafting inputs.

3. Evasion Attacks: Causing misclassification in downstream tasks like spam detection or sentiment analysis.

 

These studies emphasize the metric’s value in evaluating vulnerabilities specific to text-based models, where even subtle changes in language can lead to incorrect outputs.

 

Formula:

 

ASR is calculated using the following formula:

ASR = (Number of Successful Adversarial Attacks / Total Number of Adversarial Attempts) × 100%

 

Applications:

 

ASR has critical applications in various fields, including:

1. Image and Text Classification: Evaluating how adversarial inputs affect tasks like object detection or sentiment analysis.

2. Text Generation in LLMs: Measuring the success of adversarially altered prompts in producing harmful, biased, or unintended outputs.

3. Comparing Attack Techniques: Assessing the performance of attack algorithms such as FGSM, PGD, or prompt-specific attacks on LLMs.

4. Enhancing Model Defense: Informing techniques like adversarial training, gradient masking, or reinforcement learning to strengthen model robustness.

 

Impact:

 

For LLMs, a high ASR indicates a vulnerability to adversarial manipulation that could compromise the system’s reliability, especially in sensitive applications like healthcare, law, or finance. Monitoring and minimizing ASR ensures the safe and trustworthy deployment of AI systems.

catalogue Logos

Disclaimer: The tools and metrics featured herein are solely those of the originating authors and are not vetted or endorsed by the OECD or its member countries. The Organisation cannot be held responsible for possible issues resulting from the posting of links to third parties' tools and metrics on this catalogue. More on the methodology can be found at https://oecd.ai/catalogue/faq.