Study Finds AI Chatbots Frequently Give Inaccurate Medical Advice

Thumbnail Image

The information displayed in the AIM should not be reported as representing the official views of the OECD or of its member countries.

A Penn State study found that popular AI chatbots answer everyday health questions with about 76% accuracy, meaning roughly one in four responses is incorrect. Physicians warn that these errors could pose health risks if users rely on AI-generated advice instead of consulting medical professionals.[AI generated]

Why's our monitor labelling this an incident or hazard?

The event involves AI systems (LLMs like ChatGPT) used for health-related question answering, which is a clear AI system involvement. The study assesses the AI's use and its accuracy and potential for harm, directly addressing the risk of injury or harm to health from AI-generated medical advice. However, the article does not report any actual harm occurring, only the potential for harm based on error rates and accuracy. Therefore, this qualifies as an AI Hazard, as the AI system's use could plausibly lead to harm in real-world client-facing applications, but no direct harm has yet been documented in this report.[AI generated]
AI principles
Safety

Industries
Healthcare, drugs, and biotechnology

Affected stakeholders
Consumers

Harm types
Physical (injury)

Severity
AI hazard

AI system task:
Interaction support/chatbots


Articles about this incident or hazard

Thumbnail Image

AI chatbots answer health questions with moderate overall accuracy

2026-05-29
News-Medical.net
Why's our monitor labelling this an incident or hazard?
The event involves AI systems (LLMs like ChatGPT) used for health-related question answering, which is a clear AI system involvement. The study assesses the AI's use and its accuracy and potential for harm, directly addressing the risk of injury or harm to health from AI-generated medical advice. However, the article does not report any actual harm occurring, only the potential for harm based on error rates and accuracy. Therefore, this qualifies as an AI Hazard, as the AI system's use could plausibly lead to harm in real-world client-facing applications, but no direct harm has yet been documented in this report.
Thumbnail Image

Calling Doctor GPT: AI responses to healthcare queries are nearly 76% accurate

2026-05-28
EurekAlert!
Why's our monitor labelling this an incident or hazard?
The article explicitly involves AI systems (LLMs like ChatGPT, Gemini, Llama) used for healthcare queries, which fits the definition of AI systems. The study evaluates the AI's accuracy and potential for harm, noting that errors could be harmful but does not report any actual harm occurring. The event is about research findings and understanding AI's role and risks in healthcare, which aligns with Complementary Information as it provides supporting data and context about AI impacts and responses without describing a specific AI Incident or Hazard. There is no indication of direct or indirect harm having occurred, nor a plausible imminent hazard beyond the general risk discussed. Hence, the classification is Complementary Information.
Thumbnail Image

Calling Doctor GPT: AI Responses To Healthcare Queries Are Nearly 76% Accurate

2026-05-29
Eurasia Review
Why's our monitor labelling this an incident or hazard?
The article explicitly involves AI systems (LLMs) used for healthcare queries and evaluates their accuracy and potential for harm. While the study finds a significant error rate that could lead to harmful outcomes, no actual harm or injury is reported. The focus is on understanding the risks and informing safer use, especially emphasizing that AI is better suited as a tool for trained physicians rather than patients. This fits the definition of an AI Hazard, where the AI system's use could plausibly lead to harm but has not yet directly or indirectly caused harm. It is not an AI Incident because no realized harm is described, nor is it Complementary Information or Unrelated since the AI system and its risks are central to the event.
Thumbnail Image

AI Doctor: GPT's Healthcare Answers 76% Accurate

2026-05-28
Mirage News
Why's our monitor labelling this an incident or hazard?
The article explicitly involves AI systems (LLMs like ChatGPT-4o, ChatGPT-3.5, Gemini-1.5 Pro, and Llama3-8b) used for medical question answering. The study evaluates the accuracy and potential harm of these AI outputs but does not report any actual injury, rights violation, or other harm caused by the AI responses. The concerns raised are about possible future harm if such AI tools are used by patients without medical training. Hence, the event fits the definition of an AI Hazard, as it plausibly could lead to harm but no harm has yet occurred or been documented.
Thumbnail Image

Doctor GPT: AI Achieves Nearly 76% Accuracy in Answering Healthcare Queries

2026-05-28
Scienmag: Latest Science and Health News
Why's our monitor labelling this an incident or hazard?
The article explicitly involves AI systems (large language models) used to answer health queries, with a detailed assessment of their clinical accuracy and potential harm. The study documents realized risks of misinformation and errors that could harm patients if AI advice is used uncritically, thus linking AI use directly to potential harm to health (harm category a). Although the study is research-focused and does not report a specific incident of harm occurring, it clearly identifies the AI systems' error rates and the plausible risk of harm from their use by laypersons without clinical oversight. This constitutes an AI Hazard because the AI's use could plausibly lead to harm, but no actual harm event is reported. The article also discusses mitigation strategies and the importance of human oversight, but these are part of the study's findings rather than a response to an incident. Therefore, the event is best classified as an AI Hazard, reflecting credible potential for harm from AI use in healthcare contexts.
Thumbnail Image

Even the Best AI Chatbot Gets Health Questions Wrong 1 in 5 Times, Doctors Find

2026-05-29
StudyFinds
Why's our monitor labelling this an incident or hazard?
The event involves AI systems (large language models) used in medical advice, whose outputs have a significant error rate that could directly or indirectly lead to harm to individuals' health if acted upon. The article describes realized performance shortcomings and expert concerns about potential harm, but does not report a specific incident of actual harm occurring. Therefore, it fits the definition of an AI Hazard, as the AI systems' use could plausibly lead to injury or harm to health, but no concrete harm event is documented yet. The article is not merely general AI news or a product announcement, nor is it a response or update to a prior incident, so it is not Complementary Information. Hence, the classification is AI Hazard.
Thumbnail Image

AI Chatbots Do Not Consistently Deliver Accurate Health Responses

2026-05-29
Inside Precision Medicine
Why's our monitor labelling this an incident or hazard?
The event involves AI systems (large language models) used for health advice, which can directly or indirectly lead to harm to individuals' health if inaccurate information is relied upon. Although no specific harm event is reported, the study identifies a credible risk of harm from AI chatbot use in medical contexts. Therefore, this qualifies as an AI Hazard because it plausibly could lead to an AI Incident (harm to health) if users rely on inaccurate AI-generated medical advice. The article is not about a realized harm incident, nor is it merely complementary information or unrelated news. It focuses on the plausible future harm from AI chatbot use in healthcare.