Popular AI chatbots get diagnoses wrong in more than 80% of cases, study finds
Consumer AI chatbots are not yet suitable for medical diagnosis: a study found errors in more than 80% of cases. When a model has little data about a…
AI-processed from 3DNews AI; edited by Hamidun News
Popular consumer AI chatbots perform poorly in the role of digital diagnosticians. Research has shown that when attempting to make medical diagnoses based on a limited set of symptoms, they err in more than 80% of cases.
How the bots were tested
The researchers tested not specialized medical systems, but mass-market chatbots that users turn to for quick answers to any question. They were given scenarios with incomplete patient data — roughly how a person describes their condition in their first message, without tests, physical examination, or a doctor's follow-up questions. The task seemed simple: match symptoms with a probable cause. But this is precisely where the main problem revealed itself: a confident, coherent, and conveniently formatted response often did not align with clinically correct conclusions.
It is also important that this format closely mirrors real user behavior. People rarely come to a bot with neatly organized medical records; more often it's just a few phrases about pain, fever, weakness, or an unusual symptom they want to quickly explain without scheduling a clinic visit. So the test essentially checked not an abstract ability of the model to reason about medicine, but its suitability in an everyday scenario where there is temptation to replace a specialist consultation with an instant chat answer.
Where the errors come from
Universal models are good at summarizing general information, explaining terminology, and compiling information into a compact answer. But medical diagnostics works differently: it requires working with uncertainty, ruling out similar possibilities, accounting for rare but dangerous scenarios, and sometimes honestly saying that there is insufficient data.
A consumer bot tends to try to complete a request with a helpful-seeming answer rather than stop at a safe formulation and refer the person to a specialist. An additional problem is that the mass-market chatbot does not conduct a proper diagnostic dialogue. It may ask a couple of clarifying questions, but does not build systematic hypothesis testing, does not correlate answers with objective measurements, and bears no clinical responsibility for the outcome.
Even if the model guesses the general direction, it easily misses details that for a doctor change the entire conclusion: symptom duration, comorbidities, medications, age, recent surgeries, or the pattern of deterioration. The problem is compounded when symptoms overlap across dozens of conditions and the user describes them imprecisely or too briefly.
Under these conditions, the model begins to fill in the picture by template and compresses uncertainty into one confident answer, whereas in real practice a doctor would likely keep several versions open and order additional tests. This is what makes the error particularly unnoticed by the user.
As a result, typical failures occur:
- bot confuses conditions with similar symptoms
- downplays the urgency of potentially dangerous cases
- gives one confident diagnosis where a list of possibilities is needed
- fails to separate reference information from medical decision-making
Why this is dangerous
The main risk is not that the bot sometimes makes mistakes, but that it makes them convincingly. To a user, a calm and confident tone can look like a sign of competence, even though there is no physical examination, access to medical history, or lab test verification behind the answer.
If a person receives false reassurance, they may postpone a doctor's visit, miss deterioration in their condition, or choose wrong actions in the first hours when response speed is especially important. This scenario is especially dangerous where symptoms resemble something harmless but actually require urgent evaluation: for example, severe pain, shortness of breath, neurological symptoms, or signs of infection. In such cases, an error is not just an imprecise chat formulation, but lost time.
Consumer bots are optimized for conversational comfort and a sense of usefulness, not for conservative medical triage where it is better to refer someone to a doctor one extra time than to miss a critical signal. This does not mean AI is useless in medicine. Such systems can help formulate complaints, explain terminology, gather questions for an appointment, or remind people what information to prepare before a consultation. But as a tool for making diagnoses, mass-market chatbots are unreliable so far, especially when information is scarce, symptoms are vague, and the cost of error is high.
In this role, it is more sensible to use them as a preparatory and reference layer before seeing a doctor, rather than as a final arbiter.
What this means
The study's conclusion is quite stark: popular AI bots cannot be perceived as a replacement for a doctor, even if they quickly find information and speak with confidence. For users, this is a line of trust; for companies, it is a signal that medical scenarios require specialized tuning, expert verification, and very careful presentation of answers.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.