KAIST Researchers Expose Critical Security Flaw in MoE-Based AI Models

Thumbnail Image

The information displayed in the AIM should not be reported as representing the official views of the OECD or of its member countries.

KAIST researchers have revealed a major security vulnerability in Mixture-of-Experts (MoE) architectures used by large language models like Google Gemini. Their study shows that introducing a single malicious expert model can increase harmful AI responses from 0% to 80%, severely compromising model safety without noticeable performance loss.[AI generated]

Why's our monitor labelling this an incident or hazard?

The event explicitly involves AI systems, namely commercial LLMs using a mixture of experts architecture. The research demonstrates a method by which malicious manipulation of one expert model can lead to the AI system generating harmful or unsafe responses, which constitutes a direct threat to the safety and reliability of the AI system. Although no actual harm to people or property is reported as having occurred yet, the described vulnerability clearly poses a credible risk of harm if exploited. Therefore, this event qualifies as an AI Hazard because it describes a plausible future harm stemming from the development and use of AI systems, specifically a security vulnerability that could lead to unsafe AI behavior.[AI generated]

AI principles

Robustness & digital securitySafetyAccountabilityTransparency & explainability

Industries

Digital security

Affected stakeholders

General publicBusiness

Harm types

ReputationalPsychological

Severity

AI hazard

AI system task:

Content generation

Articles about this incident or hazard

Thumbnail Image

구글 제미나이 등 상용 LLM 핵심 구조서 새 취약점 발견

2025-12-26

Chosunbiz

Why's our monitor labelling this an incident or hazard?

The event explicitly involves AI systems, namely commercial LLMs using a mixture of experts architecture. The research demonstrates a method by which malicious manipulation of one expert model can lead to the AI system generating harmful or unsafe responses, which constitutes a direct threat to the safety and reliability of the AI system. Although no actual harm to people or property is reported as having occurred yet, the described vulnerability clearly poses a credible risk of harm if exploited. Therefore, this event qualifies as an AI Hazard because it describes a plausible future harm stemming from the development and use of AI systems, specifically a security vulnerability that could lead to unsafe AI behavior.

Thumbnail Image

카이스트, 해킹에 취약한 '챗GPT 구조' 밝혔다

2025-12-26

Chosunbiz

Why's our monitor labelling this an incident or hazard?

The event involves an AI system (large language models with MoE architecture) and its development and use. The research identifies a structural vulnerability that could be exploited by hackers to insert malicious expert AIs, leading to harmful AI outputs. This constitutes a plausible risk of harm (e.g., harmful or unsafe AI responses) that could affect users or communities. Since the article focuses on the potential for harm due to hacking and misuse of AI systems rather than describing an actual realized harm incident, it fits the definition of an AI Hazard. The article does not describe a specific AI Incident where harm has already occurred, nor is it merely complementary information or unrelated news.

Thumbnail Image

AI를 너무 믿지마세요...하나만 잘못 되어도 안전성 '와르르' - 매일경제

2025-12-26

mk.co.kr

Why's our monitor labelling this an incident or hazard?

The event involves the use and development of AI systems (LLMs composed of multiple expert AI models). The research identifies a security vulnerability that could plausibly lead to AI incidents involving harm to communities or safety if malicious expert models are introduced. Since the article describes a newly discovered attack method that could undermine AI safety and cause harmful outputs, but does not report actual harm occurring yet, this qualifies as an AI Hazard. The article focuses on the potential for harm and the need for safety verification, not on an incident where harm has already happened.

Thumbnail Image

KAIST "구글 Gemini 등 LLM 구조 악용한 보안 위협 규명"

2025-12-25

아시아경제

Why's our monitor labelling this an incident or hazard?

The event involves the use and potential misuse of AI systems (LLMs with MoE architecture). The research reveals a security vulnerability that could directly lead to harm by causing AI systems to generate dangerous or harmful outputs. Although no actual harm is reported as having occurred yet, the demonstrated attack method shows a credible and plausible risk of harm stemming from the AI system's structure and use. Therefore, this qualifies as an AI Hazard because it identifies a plausible future harm due to the AI system's vulnerability, but no incident of realized harm is described in the article.

Thumbnail Image

최신 AI에 스파이가? LLM에 숨어 보안 위협하는 '악성 전문가' | 한국일보

2025-12-26

한국일보

Why's our monitor labelling this an incident or hazard?

The event involves the use and potential misuse of AI systems (LLMs with MoE architecture). The research identifies a security vulnerability that could plausibly lead to AI incidents involving harm through unsafe or malicious AI outputs. Although no actual harm is reported as having occurred yet, the demonstrated attack method and its potential impact constitute a credible risk of future harm. Therefore, this qualifies as an AI Hazard rather than an AI Incident or Complementary Information, as the article focuses on the plausible threat rather than a realized harm or a response to a past incident.

Thumbnail Image

"챗GPT·제미나이 답변, '악성 전문가AI'가 조작 가능"... 국내 연구진이 밝혔다

2025-12-26

문화일보

Why's our monitor labelling this an incident or hazard?

The event involves AI systems (LLMs with MoE architecture) and their potential misuse or malfunction due to malicious expert models. Although the article describes a scenario where harmful outputs could be generated, it is presented as a research finding demonstrating a plausible security vulnerability rather than a real-world incident causing harm. No actual harm or violation has been reported as having occurred yet. Therefore, this qualifies as an AI Hazard because it plausibly could lead to an AI Incident if exploited, but no incident has materialized at this time.

Thumbnail Image

KAIST, 구글 제미나이 구조 악용한 '악성 전문가 AI' 위협 규명

2025-12-26

디지털데일리

Why's our monitor labelling this an incident or hazard?

The event involves the use and potential misuse of AI systems, specifically large language models with MoE architecture. The research demonstrates that the malicious manipulation of one expert model can directly lead to harmful AI outputs, which constitutes harm to users and communities relying on these AI systems. Since the article describes a concrete attack method that increases harmful responses from 0% to 80%, this is a realized harm scenario rather than a mere potential risk. Therefore, this qualifies as an AI Incident because the AI system's use and manipulation have directly led to significant harm in terms of unsafe AI behavior.

Thumbnail Image

KAIST, 구글 제미나이 구조 악용한 '악성 전문가 AI' 보안 위협 규명

2025-12-26

이뉴스투데이

Why's our monitor labelling this an incident or hazard?

The event involves the identification of a security vulnerability in AI systems (large language models using MoE architecture) that can be exploited to cause the AI to produce harmful outputs. This constitutes a direct harm to the safety and reliability of AI systems, which can lead to harm to users or communities relying on these AI outputs. Although the article does not describe an actual incident of harm occurring yet, the demonstrated attack method shows a clear and significant risk of harm if exploited. Therefore, this qualifies as an AI Hazard because it plausibly could lead to an AI Incident involving harm to people or communities through unsafe AI behavior. The article focuses on the research discovery and its implications rather than reporting a realized harm, so it is not an AI Incident. It is more than complementary information because it reveals a new credible threat rather than just updates or responses to existing issues.

Thumbnail Image

카이스트, 구글 '제미나이' 모델 취약점 발견..."MoE 구조가 보안 구멍" - 이비엔(EBN)뉴스센터

2025-12-26

이비엔(EBN)뉴스센터

Why's our monitor labelling this an incident or hazard?

The event involves the discovery of a security vulnerability in an AI system architecture (MoE) that can be exploited to cause the AI system to generate harmful outputs. This is a direct harm to the safety and reliability of AI systems, which can be considered harm to users or communities relying on the AI outputs. The research identifies a concrete attack method that increases harmful responses from 0% to 80%, indicating realized or imminent harm potential. Therefore, this qualifies as an AI Incident because the AI system's development and use have directly led to a significant harm related to AI safety and security.

Thumbnail Image

[헬로티 HelloT] KAIST, 제미나이 등 전문가 혼합 AI의 구조적 보안 리스크 규명

2025-12-26

hellot.net

Why's our monitor labelling this an incident or hazard?

The article explicitly involves AI systems, specifically large language models using the MoE architecture. The research reveals a security weakness that can be exploited to cause harmful outputs, which constitutes a direct or indirect harm to users or communities relying on these AI systems. Although the harm is demonstrated experimentally and not necessarily reported as having caused real-world damage yet, the demonstrated attack method shows a clear and significant risk of harm. Therefore, this event qualifies as an AI Incident because the AI system's use and structural vulnerability have directly led to a significant security risk with potential harmful consequences.

Thumbnail Image

'이것' 하나로 제미나이·챗GPT 등 유해응답 발생률 80%까지 증가 - 월요신문

2025-12-26

월요신문

Why's our monitor labelling this an incident or hazard?

The event involves the use and potential misuse of AI systems (large language models with MoE architecture) leading to a significant increase in harmful outputs, which constitutes harm to users and communities through unsafe AI responses. The research reveals a concrete attack method that compromises AI safety, thus directly linking AI system use and malfunction to harm. Therefore, this qualifies as an AI Incident because the AI system's malfunction or exploitation has directly led to increased harmful outputs, a form of harm to communities and users.

Thumbnail Image

"LLM에 악성 AI 하나만 섞여도 안전성 무너진다" - 서울파이낸스

2025-12-26

서울파이낸스

Why's our monitor labelling this an incident or hazard?

The event involves the development and use of AI systems (LLMs with MoE architecture) and identifies a novel attack method that could seriously compromise AI safety by causing harmful outputs. The research demonstrates that a single malicious expert model can degrade the safety of the entire system, which could plausibly lead to harms such as generating harmful or unsafe content. Since no actual harm has yet occurred but the risk is credible and significant, this qualifies as an AI Hazard under the definitions provided. The article focuses on the potential security threat and does not report a realized harm or incident.

Thumbnail Image

AI에 '프락치' 심는다...거대언어모델 새 보안 위협 규명

2025-12-26

dongascience.com

Why's our monitor labelling this an incident or hazard?

The event involves the use and potential misuse of an AI system (LLM with MoE architecture) leading to a direct risk of harm through generation of harmful AI outputs. Although no actual harm is reported as having occurred yet, the demonstrated attack technique plausibly leads to an AI Incident by significantly increasing harmful responses, which can cause harm to individuals or communities. Therefore, this event qualifies as an AI Hazard because it plausibly leads to harm, but since no actual harm has been reported as realized, it is not yet an AI Incident. The article focuses on the identification of a new security threat and its implications, not on a realized incident or harm.

Thumbnail Image

'MoE 구조 악용하면 제미나이 유해응답 최대 80%↑'... KAIST, LLM 보안 취약성 세계 최초 규명

2025-12-26

쿠키뉴스

Why's our monitor labelling this an incident or hazard?

The article explicitly involves AI systems, specifically large language models with MoE architecture. It details a security vulnerability discovered through research, showing how malicious manipulation could lead to harmful AI outputs. No actual harm or incident is reported as having occurred yet; the focus is on the potential for such harm if the vulnerability is exploited. This aligns with the definition of an AI Hazard, where the AI system's development or use could plausibly lead to harm. The event is not merely complementary information because it introduces a new, significant potential risk rather than updates or responses to past incidents. It is not unrelated because it directly concerns AI system vulnerabilities with safety implications.

Thumbnail Image

[팩플] AI 효율 높이려다 안전 위협...KAIST 새로운 보안 위협 규명 | 중앙일보

2025-12-26

중앙일보

Why's our monitor labelling this an incident or hazard?

The event involves the use and potential misuse of AI systems (LLMs with MoE architecture). The research experimentally demonstrates that a malicious expert AI can cause the AI system to generate harmful outputs, which constitutes a direct threat to AI safety and could lead to harm to users or communities relying on such AI. However, the article describes a research discovery and experimental demonstration rather than an actual incident causing harm. Therefore, it is an AI Hazard because it plausibly leads to AI incidents by exposing a new attack vector that could be exploited to cause harm in the future.

Thumbnail Image

KAIST 연구, 전문가 혼합(MoE) 기반 LLM의 보안 위협 경고: AI 안전성 흔들

2025-12-26

Head Topics

Why's our monitor labelling this an incident or hazard?

The event involves an AI system (LLMs using MoE architecture) and describes a security vulnerability in its development and use that leads to a significant increase in harmful AI outputs (harm to communities/users). The malicious insertion of an expert AI model causes the AI system to produce unsafe responses, which is a direct harm linked to the AI system's malfunction or misuse. Therefore, this qualifies as an AI Incident because the AI system's use and vulnerability have directly led to harm (increased harmful responses).