Meta's AI Safety Model Vulnerability Exposed

Thumbnail Image

The information displayed in the AIM should not be reported as representing the official views of the OECD or of its member countries.

Meta's AI safety model, Prompt-Guard-86M, designed to prevent prompt injection and jailbreak attacks, is vulnerable to a simple trick involving spaces and punctuation omission. This flaw allows bypassing its security features, potentially leading to harmful outputs. The vulnerability was discovered by Aman Priyanshu from Robust Intelligence.[AI generated]

Why's our monitor labelling this an incident or hazard?

An AI system (Prompt-Guard-86M) is explicitly involved and its failure creates a clear pathway for harm (malicious prompt injection). No actual incident of harm is reported, but the vulnerability introduces a credible risk of future misuse, fitting the definition of an AI Hazard.[AI generated]
AI principles
Robustness & digital securitySafetyAccountabilityTransparency & explainabilityHuman wellbeingDemocracy & human autonomy

Industries
Media, social platforms, and marketingDigital securityIT infrastructure and hosting

Affected stakeholders
ConsumersGeneral public

Harm types
PsychologicalReputationalEconomic/PropertyPublic interest

Severity
AI hazard

Business function:
ICT management and information securityMonitoring and quality control

AI system task:
Event/anomaly detectionInteraction support/chatbots


Articles about this incident or hazard

Thumbnail Image

Meta's AI safety model vulnerable to simple space bar trick

2024-07-30
NewsBytes
Why's our monitor labelling this an incident or hazard?
An AI system (Prompt-Guard-86M) is explicitly involved and its failure creates a clear pathway for harm (malicious prompt injection). No actual incident of harm is reported, but the vulnerability introduces a credible risk of future misuse, fitting the definition of an AI Hazard.
Thumbnail Image

Meta's AI safety system defeated by the space bar

2024-07-29
TheRegister.com
Why's our monitor labelling this an incident or hazard?
The article explicitly involves an AI system (Prompt-Guard-86M) and its malfunction (failure to detect prompt injection attacks due to a simple bypass technique). Although no actual harm has been reported yet, the vulnerability could plausibly lead to AI incidents such as the generation of harmful, dangerous, or sensitive content by AI models that rely on Prompt-Guard for safety. Therefore, this event fits the definition of an AI Hazard, as it describes a credible risk of future harm stemming from the AI system's malfunction.
Thumbnail Image

Study Shows Meta AI Safety System Easily Compromised

2024-08-01
ChannelE2E
Why's our monitor labelling this an incident or hazard?
The article explicitly involves an AI system (Prompt-Guard-86M) designed for AI safety, which is compromised by a prompt injection attack. The compromised system's failure to detect malicious prompts could directly or indirectly lead to harms such as spreading misinformation, enabling malicious AI use, or other security breaches. Since the vulnerability has been demonstrated and the system's defenses are effectively bypassed, this constitutes an AI Incident due to the realized security failure and potential for harm.
Thumbnail Image

Meta Prompt Guard Is Vulnerable to Prompt Injection Attacks

2024-07-30
DataBreachToday
Why's our monitor labelling this an incident or hazard?
The event involves an AI system (Meta's Prompt Guard) designed to prevent prompt injection attacks, which is an AI security mechanism. The researchers' discovery shows a malfunction or weakness in the AI system's defenses, which could plausibly lead to an AI Incident if exploited maliciously. Since no actual harm has been reported yet, but the vulnerability could plausibly lead to harm, this qualifies as an AI Hazard. The article focuses on the vulnerability and the potential risk rather than an actual incident of harm, and it includes information about ongoing mitigation efforts, but the primary focus is the vulnerability and its implications.
Thumbnail Image

meta-llama/Prompt-Guard-86M · Hugging Face

2024-07-30
huggingface.co
Why's our monitor labelling this an incident or hazard?
The article focuses on the development and deployment of an AI system (Prompt Guard) aimed at detecting and preventing malicious prompt attacks on LLMs. There is no indication that the system has caused any harm or that an incident involving harm has occurred. Nor does it describe a credible risk of harm that could plausibly lead to an AI Incident. Instead, it provides contextual and technical information about the model, its training, and its intended use as a protective measure. Therefore, this event is best classified as Complementary Information, as it enhances understanding of AI ecosystem responses to prompt attack risks without reporting a new incident or hazard.
Thumbnail Image

Meta's PromptGuard model bypassed by simple jailbreak, researchers say

2024-07-31
SC Media
Why's our monitor labelling this an incident or hazard?
The article explicitly involves an AI system (PromptGuard) designed to protect LLMs from adversarial attacks. The researchers demonstrated a method to bypass this AI system's protections, which could plausibly lead to harmful outputs or data leaks from LLMs. Although no actual harm is reported, the exploit's effectiveness and the potential consequences meet the criteria for an AI Hazard. The event does not describe a realized harm (AI Incident), nor is it merely complementary information or unrelated news. Hence, the classification as AI Hazard is appropriate.