Researchers Bypass AI Image Generator Safeguards to Create NSFW Content

Thumbnail Image

The information displayed in the AIM should not be reported as representing the official views of the OECD or of its member countries.

Johns Hopkins University researchers demonstrated that popular AI image generators like DALL-E 2 and Stable Diffusion can be manipulated to produce NSFW and violent images by bypassing safety filters using adversarial prompts. This vulnerability allows anyone, including malicious users, to generate inappropriate and potentially harmful content.[AI generated]

Why's our monitor labelling this an incident or hazard?

The article explicitly involves AI systems (DALL-E 2 and Stable Diffusion) and their use in generating images. The researchers demonstrated that these AI systems can be manipulated to produce harmful content, which is a direct misuse or malfunction of the AI systems' safety mechanisms. The generation of NSFW and misleading images constitutes harm to communities and individuals by enabling the spread of inappropriate and potentially deceptive content. Since the harmful outputs have been produced and the systems' vulnerabilities exploited, this is a realized harm rather than a mere potential risk. The event thus meets the criteria for an AI Incident, as the AI systems' malfunction or misuse has directly led to harm.[AI generated]
AI principles
Robustness & digital securitySafetyAccountabilityTransparency & explainability

Industries
Media, social platforms, and marketingArts, entertainment, and recreationConsumer services

Affected stakeholders
General public

Harm types
PsychologicalReputational

Severity
AI incident

AI system task:
Content generation


Articles about this incident or hazard

Thumbnail Image

AI image generators can be tricked into making NSFW content

2023-11-01
The Hub
Why's our monitor labelling this an incident or hazard?
The article explicitly involves AI systems (DALL-E 2 and Stable Diffusion) and their use in generating images. The researchers demonstrated that these AI systems can be manipulated to produce harmful content, which is a direct misuse or malfunction of the AI systems' safety mechanisms. The generation of NSFW and misleading images constitutes harm to communities and individuals by enabling the spread of inappropriate and potentially deceptive content. Since the harmful outputs have been produced and the systems' vulnerabilities exploited, this is a realized harm rather than a mere potential risk. The event thus meets the criteria for an AI Incident, as the AI systems' malfunction or misuse has directly led to harm.
Thumbnail Image

Researchers Say NSFW AI Images Can be Generated by Nonsense Prompts

2023-11-02
PetaPixel
Why's our monitor labelling this an incident or hazard?
The researchers exploited AI image generators (AI systems) to produce harmful NSFW and misleading images by bypassing content filters, demonstrating a direct link between AI system use and harm. The generation of fake images of public figures and violent scenes can cause reputational harm and spread misinformation, which are recognized harms under the framework. The event involves actual realized harm through the creation and potential dissemination of harmful content, not just a theoretical risk. Hence, it meets the criteria for an AI Incident rather than a hazard or complementary information.
Thumbnail Image

"Nonsense" Prompts Trick AIs Into Producing NSFW Images

2023-11-03
Technology Networks
Why's our monitor labelling this an incident or hazard?
The article explicitly involves AI image generation systems manipulated to produce harmful content that safety filters are supposed to block. The misuse of these AI systems directly leads to the creation of inappropriate and potentially harmful images, which constitutes harm to communities and possibly individuals' reputations. The researchers demonstrate that the AI systems' safeguards are insufficient, and the misuse is feasible by anyone with the adversarial prompts. This direct link between AI system misuse and harmful outputs fits the definition of an AI Incident, as the harm is realized and not merely potential. The event is not just a warning or a future risk but a demonstrated exploitation causing harm, thus not an AI Hazard or Complementary Information.
Thumbnail Image

Rated-R robot! People can trick AI into creating NSFW content

2023-11-02
Study Finds
Why's our monitor labelling this an incident or hazard?
The event involves AI systems explicitly (DALL-E 2 and Stable Diffusion) and their use is manipulated to produce harmful outputs (NSFW and violent images). This misuse directly leads to harm in the form of potential reputational damage, misinformation, and community harm. The research shows that the AI systems' safeguards are insufficient, enabling malicious or careless users to generate harmful content. Therefore, this qualifies as an AI Incident because the AI systems' use has directly led to harm (or at least the creation of harmful content).
Thumbnail Image

AI image generators can be tricked into making NSFW content

2023-11-02
Tech Xplore
Why's our monitor labelling this an incident or hazard?
The AI systems (image generators) are explicitly involved and manipulated to produce harmful outputs. Although the article does not report actual incidents of harm, it clearly establishes that the AI systems' malfunction or misuse could plausibly lead to harms such as misinformation, reputational damage, or exposure to inappropriate content. This fits the definition of an AI Hazard, as the event shows credible potential for harm stemming from the AI systems' vulnerabilities and misuse. The article focuses on the demonstration of this risk and the potential for future harm rather than reporting a realized harm incident.
Thumbnail Image

AI Image Generators Prone to Creating NSFW Content

2023-11-01
Mirage News
Why's our monitor labelling this an incident or hazard?
The article explicitly involves AI systems (DALL-E 2 and Stable Diffusion) and describes how their use has directly led to the creation of harmful content (NSFW and violent images) by bypassing safety filters. This misuse can cause harm to individuals and communities by spreading inappropriate or misleading images, fulfilling the criteria for harm to communities and violation of content safety. The event is not merely a potential risk but demonstrates actual exploitation and harm, qualifying it as an AI Incident rather than a hazard or complementary information.