NTU researchers develop 'Masterkey' AI to jailbreak ChatGPT and other chatbots

Thumbnail Image

The information displayed in the AIM should not be reported as representing the official views of the OECD or of its member countries.

Researchers at Singapore's Nanyang Technological University created 'Masterkey', a large language model technique that pits chatbots like ChatGPT, Google Bard and Bing Chat against each other to reverse-engineer and bypass their safety filters, enabling them to generate content blocked by their developers.[AI generated]

Why's our monitor labelling this an incident or hazard?

The article explicitly involves AI systems (large language model chatbots) and their development and use. The researchers' method directly targets the AI systems' safeguards to enable harmful outputs, which could plausibly lead to harms such as dissemination of malicious content or misinformation. Although no actual harm is reported as having occurred yet, the described capability represents a credible risk of harm resulting from AI misuse. Therefore, this event qualifies as an AI Hazard because it plausibly could lead to an AI Incident involving harm to communities or other significant harms.[AI generated]

AI principles

SafetyRobustness & digital securityTransparency & explainabilityAccountabilityRespect of human rightsFairness

Industries

Media, social platforms, and marketingDigital securityIT infrastructure and hostingConsumer services

Affected stakeholders

General publicConsumers

Harm types

ReputationalPublic interestPsychologicalEconomic/PropertyHuman or fundamental rights

Severity

AI hazard

Business function:

Research and development

AI system task:

Content generationInteraction support/chatbots

Articles about this incident or hazard

Thumbnail Image

Researchers just unlocked ChatGPT | Digital Trends

2024-01-04

Digital Trends

Why's our monitor labelling this an incident or hazard?

The event involves the use and development of AI systems (large language models/chatbots) and their security mechanisms. The researchers' method directly undermines the AI systems' safety features, enabling outputs that are normally blocked due to violent, immoral, or malicious content. This creates a credible risk of harm, such as dissemination of harmful or dangerous content, misinformation, or enabling malicious use of AI. Since the jailbreak method has been demonstrated and is effective, and the article discusses the potential for bad actors to exploit this, the event constitutes an AI Incident due to the realized vulnerability and direct link to potential harms from misuse of AI outputs.

Thumbnail Image

Chatbot vs chatbot - researchers train AI chatbots to hack each other, and they can even do it automatically

2024-01-02

TechRadar

Why's our monitor labelling this an incident or hazard?

The article explicitly involves AI systems (large language model chatbots) and their development and use. The researchers' method directly targets the AI systems' safeguards to enable harmful outputs, which could plausibly lead to harms such as dissemination of malicious content or misinformation. Although no actual harm is reported as having occurred yet, the described capability represents a credible risk of harm resulting from AI misuse. Therefore, this event qualifies as an AI Hazard because it plausibly could lead to an AI Incident involving harm to communities or other significant harms.

Thumbnail Image

AI chatbots trained to jailbreak other chatbots, as the AI war slowly but surely begins

2024-01-02

pcgamer

Why's our monitor labelling this an incident or hazard?

The article explicitly involves AI systems (large language models/chatbots) and their use in a novel method to compromise other AI systems' safety mechanisms. Although no direct harm is reported as having occurred yet, the capability to generate unethical content by jailbreaking chatbots plausibly leads to harms such as abusive or violent content dissemination, which can harm communities and violate ethical standards. The researchers' reporting of the issue and the ongoing challenge in preventing such attacks further support the classification as an AI Hazard rather than an Incident. The event is not merely general AI news or a complementary update but a credible warning of potential harm from AI misuse.

Thumbnail Image

This AI Chatbot is Trained to Jailbreak Other Chatbots

2024-01-03

VICE

Why's our monitor labelling this an incident or hazard?

The article explicitly involves AI systems: large language models (ChatGPT, Bard, Bing Chat) and an AI tool (Masterkey) trained to generate jailbreak prompts. The use of Masterkey directly leads to the generation of harmful content by the chatbots, including illegal and abusive material, which is a clear harm to communities and a violation of content policies and potentially rights. The researchers' demonstration of successful jailbreaks with a significant success rate confirms realized harm rather than just potential. Although the researchers aim to help improve defenses, the event describes actual misuse and harm caused by AI systems. Hence, this is an AI Incident rather than a hazard or complementary information.

Thumbnail Image

Researchers put AI chatbots against themselves to "jailbreak" each other

2024-01-02

Notebookcheck

Why's our monitor labelling this an incident or hazard?

The article explicitly involves AI systems (large language models/chatbots) and their use in a novel way to bypass restrictions, which is a manipulation of AI system behavior. While this could plausibly lead to harmful outputs or misuse (e.g., generating harmful or disallowed content), the article does not describe any realized harm or incident resulting from this method. Therefore, it fits the definition of an AI Hazard, as it plausibly could lead to an AI Incident in the future but no harm has yet materialized.

Thumbnail Image

Researchers create AI that can 'jailbreak' other chatbots

2024-01-02

ReadWrite

Why's our monitor labelling this an incident or hazard?

The article explicitly describes an AI system developed to 'jailbreak' other chatbots and access restricted content, which is a clear AI system involvement. Although the researchers' intent is to highlight risks and improve safety, the AI system's capability to bypass protections could plausibly lead to harms such as enabling malicious actors to obtain dangerous information. No actual harm is reported as having occurred yet, so this is a plausible future harm scenario. Hence, the event fits the definition of an AI Hazard rather than an AI Incident or Complementary Information.

Thumbnail Image

This New Chatbot Can Jailbreak Other Chatbots - Wonderful Engineering

2024-01-01

Wonderful Engineering

Why's our monitor labelling this an incident or hazard?

Masterkey is an AI system explicitly designed to exploit vulnerabilities in other AI systems (LLMs) to generate forbidden or dangerous content, which constitutes harm to communities and potentially violates content safety norms. The generation of harmful content by compromised chatbots is a direct consequence of Masterkey's use, fulfilling the criteria for an AI Incident. Although the researchers' intent is to improve AI security, the event describes actual realized harm through the generation of forbidden content, not just a potential risk. Therefore, this event qualifies as an AI Incident rather than a hazard or complementary information.

Thumbnail Image

AI Chatbots Vulnerable to "Masterkey" Attack, Researchers Warn | Cryptopolitan

2024-01-04

Cryptopolitan

Why's our monitor labelling this an incident or hazard?

The event involves the use and development of AI systems (large language models/chatbots) and a novel method to circumvent their safety restrictions. Although the article does not report any realized harm, the Masterkey process could plausibly lead to AI incidents by enabling chatbots to generate harmful, violent, or malicious content that was previously restricted. This represents a credible security hazard with potential for significant harm to users or communities if exploited maliciously. Therefore, this event fits the definition of an AI Hazard, as it describes a plausible future risk stemming from AI system vulnerabilities.

Thumbnail Image

AI Jailbreaks: 'Masterkey' Model Bypasses ChatGPT Safeguards

2024-01-02

ITPro Today

Why's our monitor labelling this an incident or hazard?

The article explicitly involves AI systems: the Masterkey LLM developed to generate jailbreak prompts for other LLM-based chatbots like ChatGPT, Google Bard, and Bing Chat. The Masterkey model's purpose is to circumvent safeguards to produce content that developers intended to restrict, including violent or unethical outputs. This use of AI to bypass safety mechanisms creates a credible risk of harm (e.g., generation of harmful or criminal content), even though the article does not document actual incidents of harm occurring. The researchers themselves describe this as a "clear and present threat." Since the harm is plausible but not yet realized or documented, the event fits the definition of an AI Hazard rather than an AI Incident. The article also mentions that the researchers reported the vulnerabilities to AI developers, indicating ongoing mitigation efforts but no resolved incident. Thus, the classification is AI Hazard.

Thumbnail Image

Un grupo de científicos le hicieron un jailbreak a ChatGPT - Digital Trends Español

2024-01-04

Digital Trends Español

Why's our monitor labelling this an incident or hazard?

The article explicitly involves AI systems (LLM chatbots) and their development and use. The researchers' method directly undermines the safety mechanisms designed to prevent harmful or prohibited content generation, which could plausibly lead to AI incidents such as dissemination of harmful, violent, or malicious content. Although no actual harm is reported as having occurred yet, the demonstrated ability to bypass safeguards represents a credible risk of future harm. Therefore, this event qualifies as an AI Hazard because it plausibly could lead to an AI Incident if exploited maliciously or unintentionally.

Thumbnail Image

Investigadores hackean los chatbots de IA más populares con un método que se actualiza automáticamente

2024-01-04

Mundo Deportivo

Why's our monitor labelling this an incident or hazard?

The article explicitly involves AI systems (large language model chatbots) and describes a new AI system developed by researchers that can hack these chatbots to bypass their safety restrictions. The method enables the generation of restricted content, which could include harmful or illegal outputs. While the article does not report actual incidents of harm occurring, the described capability clearly poses a credible risk of future harm, such as spreading harmful or criminal content. This fits the definition of an AI Hazard, as the AI system's development and use could plausibly lead to an AI Incident. There is no indication that harm has already occurred, so it is not an AI Incident. The article is not merely complementary information or unrelated news, as it focuses on a new AI system with potential for harm.

Thumbnail Image

Nueva crisis en la IA: científicos usan un 'jailbreak' permite saltarse todas las normas

2024-01-05

Vandal

Why's our monitor labelling this an incident or hazard?

The event involves the development and use of an AI system (chatbots based on large language models) and a method (jailbreak) that enables these AI systems to bypass ethical and legal safeguards. This directly relates to the AI system's development and use. While the article does not report actual harm occurring, the described capability plausibly leads to AI incidents such as violations of ethical norms, potential illegal content generation, and harm to communities. Therefore, this constitutes an AI Hazard because it plausibly leads to significant harms through misuse of AI systems.

Thumbnail Image

Un grupo de 'hackers' sabotea ChatGPT y Bard para que generen contenidos inapropiados

2024-01-05

El Confidencial

Why's our monitor labelling this an incident or hazard?

The article explicitly involves AI systems (ChatGPT and Bard, large language models) and their manipulation through jailbreaking techniques to generate inappropriate content. Although no direct harm occurred, the experiment reveals vulnerabilities that could plausibly lead to AI incidents if exploited by malicious actors. Therefore, this event qualifies as an AI Hazard because it demonstrates a credible risk of future harm stemming from AI system misuse, but no actual harm has been reported yet.

Thumbnail Image

Emplean un 'chatbot' propio que hace 'jailbreak' sobre ChatGPT y...

2024-01-04

europa press

Why's our monitor labelling this an incident or hazard?

The article explicitly involves AI systems (LLM-based chatbots) and their development and use. The researchers' method enables malicious circumvention of safety controls, which could plausibly lead to AI incidents involving harm to communities through unethical or harmful content generation. Although no actual harm is reported yet, the credible risk and potential for misuse are clear. The event is not merely general AI news or a response update but a demonstration of a method that could lead to significant AI-related harm. Hence, it fits the definition of an AI Hazard rather than an AI Incident or Complementary Information.

Thumbnail Image

Crean un 'chatbot' que hace 'jailbreak' en ChatGPT para generar contenidos inapropiados

2024-01-04

El Tiempo

Why's our monitor labelling this an incident or hazard?

The event involves AI systems explicitly (large language model chatbots) and their misuse through 'jailbreaking' techniques developed by researchers. The misuse could plausibly lead to harms such as generation of inappropriate or harmful content, which can impact communities and ethical standards. Since the article discusses the development of a method that could be used maliciously in the future but does not describe actual realized harm, it fits the definition of an AI Hazard. The article also emphasizes the potential for cybercriminals to exploit these vulnerabilities, reinforcing the plausible future harm. Therefore, the classification is AI Hazard.