Bloomberg Tech→ original

ChatGPT, Gemini and Grok Gave Problematic Medical Advice in Half of Responses

A new audit of popular AI chatbots revealed an unfortunate finding: roughly half of the answers to medical questions proved problematic. Researchers tested…

AI-processed from Bloomberg Tech; edited by Hamidun News
ChatGPT, Gemini and Grok Gave Problematic Medical Advice in Half of Responses
Source: Bloomberg Tech. Collage: Hamidun News.
◐ Listen to article

A new BMJ Open audit reveals something unsettling: popular AI chatbots have already become part of everyday life, but when it comes to health matters, they cannot be relied upon as an independent source of recommendations. Researchers concluded that approximately half of the responses to medical queries turned out to be problematic — from incomplete information to advice that, without medical consultation, could push a person toward an ineffective or potentially dangerous decision. The research team tested five public services — ChatGPT, Gemini, Meta AI, Grok, and DeepSeek — across five topics where myths and misinformation are particularly common: cancer, vaccines, stem cells, nutrition, and athletic performance.

In February 2025, each bot was asked 50 questions, with a total of 250 responses analyzed. Some requests were closed-ended, with one correct answer within scientific consensus, and some were open-ended, requiring the system to provide its own explanation or list possible courses of action. The results were harsh.

Half of all responses were deemed problematic: 30% were moderately problematic, another 20% were severely problematic. In other words, this is not just about minor wording errors, but also includes advice that could lead users toward ineffective treatment or cause harm if followed without a doctor. Models performed particularly poorly on open-ended questions: when they needed to formulate their own recommendation rather than choose from given options, the share of riskiest answers notably increased.

There were differences between services, though on average all demonstrated vulnerabilities. According to the study, Grok most frequently provided the most problematic responses: 29 out of 50 of its responses, or 58%, were classified in the most severe category. Gemini, by contrast, showed the lowest share of most problematic responses and the most answers without apparent issues.

By topic, chatbots performed best on vaccine and cancer questions, and worst on stem cells, athletic performance, and nutrition — areas where there are particularly many controversial claims, alternative practices, and pseudoscientific advice in the public sphere. A separate problem involves not just content but presentation. The authors note that responses were almost always written in a confident tone and rarely accompanied by disclaimers or warnings.

Out of 250 queries, chatbots refused to answer only twice, and both refusals came from Meta AI. The quality of citations proved weak: average bibliography completeness was around 40%, and no service was able to provide a completely accurate source list, partly due to fabricated or distorted citations. Additionally, the texts were complex for a general audience and in reading level corresponded more to a college graduate than to someone simply trying to quickly understand their symptom or prescription.

The authors emphasize that this is not a verdict against using AI in medicine and not proof of chatbots' complete uselessness. The study has limitations: it covered only five models, tested them at a specific point in time, and deliberately used some queries as a stress test to expose system vulnerabilities. Therefore, the 50% problematic responses should not be mechanically applied to every everyday dialogue with AI.

But the more important conclusion is different: when a topic is controversial, emotionally charged, or already saturated with medical myths, the model easily reproduces convincingly sounding text without sufficient scientific backing. According to Gallup data from April 15, 2026, 25% of Americans have already used AI tools for medical information or advice, so this is not a niche habit but mass behavior. The practical meaning of this research is fairly straightforward.

A chatbot can be useful as a quick navigator: explaining a term, helping compile a list of questions for a doctor, or suggesting what else to clarify. But it should not replace diagnosis, clinical reasoning, and treatment selection. For AI companies, this is a signal to strengthen protective guardrails, citation checking, and user warnings.

For users themselves, a reminder that a model's confident tone does not guarantee reliability. The higher the cost of error, the less room for improvisation the machine has.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…