ChatGPT Nailed the Diagnosis in Five Cases, But Failed on Treatment Planning
ChatGPT passed five out of five cases on primary diagnosis in the experiment, including MGUS and statin-induced rhabdomyolysis. Yet divergences emerged in…
AI-processed from Habr AI; edited by Hamidun News
The experiment, in which ChatGPT was expected to make at least one diagnostic error, ended with an unexpected result for the authors: the model correctly identified the primary diagnosis in all five medical cases. But victory at the level of diagnosis formulation did not translate into an overall clinical comparison win. The most notable gap became apparent further — in the practical action plan after the response: which examinations are needed before therapy begins, which specialists to refer the patient to, which target indicators to monitor, and when to repeat tests.
It was precisely at this stage that ChatGPT more often lost to the specialized MedAssist service. The comparison included five cases: metabolic syndrome, subclinical hypothyroidism, perimenopause, MGUS, and statin-induced rhabdomyolysis. In all cases, ChatGPT hit the main diagnosis, which in itself is noteworthy for a universal LLM.
The authors acknowledge that before running the test they expected at least one serious error, but this did not happen. However, in medicine, diagnosis itself is only part of the task. The next step is no less important: is it safe to start treatment, which red flags to check in advance, and which clarifying tests are needed to avoid missing contraindications or associated risks.
This is where the difference between the models became systemic. On four routine cases, ChatGPT answered worse the question of what the patient should do in the next two weeks. This was not about beautiful wording, but about applied clinical logic: for example, to remind about PSA before testosterone replacement therapy, about mammography before prescribing menopausal hormone therapy, about target indicator levels and timelines for retesting.
In the rhabdomyolysis case, interpretation of the AST to ALT ratio also proved important — a detail that affects understanding of the causes of changes in test results and subsequent management strategy. But the comparison also had a reverse example. In the MGUS case, monoclonal gammopathy of undetermined significance, it was MedAssist that turned out to be weaker.
ChatGPT clearly calculated the albumin to globulin ratio and separately listed the confirmatory studies that the patient should take to a hematologist. The authors directly write that their service did neither, and this is why their analysis of this case turned out to be the most detailed. Such an episode is important not only as a local loss, but as a reminder: a specialized product does not gain an advantage automatically simply because it was created for a narrow task.
The authors separately note a possible conflict of interest: the text was prepared by the team that makes MedAssist, one of the two services being compared. They don't try to hide this and argue that they fixed the methodology in advance, published the responses of both services verbatim, thoroughly analyzed their own unsuccessful case rather than in passing. This doesn't eliminate questions about complete neutrality, but makes the material more useful than typical marketing demonstrations where only convenient examples are shown.
For the reader, what is most valuable here is not the score by cases, but the transparency of where exactly the models are strong and where they begin to make errors in applied solutions. The main conclusion from this test is quite straightforward: large language models can already consistently hit the diagnosis even in complex cases, but the quality of a medical response cannot be assessed solely by the first line. If the system correctly named the condition but did not suggest mandatory examinations before therapy, did not outline the route to the needed specialist, and did not clarify control timelines, the risk of error does not disappear.
For developers of medical AI services, this is a signal to shift focus from impressive diagnoses to the full patient management scenario. And for users — a reminder that the value of such systems is determined not only by recognition accuracy, but also by the safety of the next step.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.