Harvard: AI more accurate than doctors at emergency room triage

Harvard and Beth Israel Deaconess compared OpenAI o1 with doctors on real cases from the emergency department. At the initial triage stage, the model more often produced an exact or close diagnosis — 67% versus 50–55% for humans. But this is not about replacing physicians yet: the study assessed text-based clinical reasoning, not the full examination, patient interaction, or bedside decisions.

Khamidun Zhemal

AI monitoring · Guardian

Apr 30, 2026· 3 min

AI-processed from Guardian; edited by Hamidun News

Harvard: AI more accurate than doctors at emergency room triage — Source: Guardian. Collage: Hamidun News.

◐ Listen to article

A team from Harvard Medical School and Beth Israel Deaconess Medical Center reported that OpenAI's o1 reasoning model demonstrated higher accuracy than physicians in a range of emergency diagnostic tasks. The most notable result was at the stage of initial triage in the emergency department, where data is limited but decisions must be made quickly.

How the Comparison Was Conducted

The study was published on April 30, 2026 in Science journal and became one of the largest attempts to compare AI not against examination tests, but against real clinical work. The authors ran the model through six experiments: from complex diagnostic cases and probabilistic reasoning to tasks involving selection of further clinical tactics. The key part of the work involved 76 real cases from the emergency department of a hospital in Boston.

The model and physicians received identical records from electronic medical charts and were asked to propose the most likely diagnoses and next steps. Importantly, the data was barely "cleaned" before the test. Researchers used the same noisy and incomplete text that a physician sees in the first minutes: vital signs, age, brief description of complaints from a nurse, individual notes from the medical history.

Verification occurred at three stages: at the moment of triage, at first contact with a physician, and at the point of decision regarding hospitalization to a ward or intensive care unit. Evaluators did not know who provided the answer — a human or the model.

Where AI Proved Stronger

The model showed its most notable advantage precisely where the physician had the least information. On early triage, OpenAI o1 provided an accurate or very close diagnosis in 67% of cases. Among physicians on the same set of patients, the rate was in the range of 50–55%. When more data became available, AI accuracy rose to 82%, while humans achieved 70–79%; here the gap was no longer statistically significant, but the trend persisted. In tasks involving case management planning, including selection of tests, antibiotics, and discussion of treatment goals, the model also performed significantly better.

67% — accurate or close diagnosis by AI on initial triage
50–55% — physician results at the same stage
82% — AI accuracy after additional data arrived
89% — model performance in case management tasks versus 34% for physicians

The authors provided a telling example. In one case, a patient arrived with a blood clot in the lungs and deteriorating condition. Physicians assumed that standard anticoagulant therapy had failed. The model, however, connected the picture to lupus in the patient's history and hypothesized that the source of the problem was lung inflammation in that context. Later, this version was confirmed. Researchers particularly noted that the model worked confidently with rare and complex cases.

Why This Is Not a Replacement for Physicians

These results do not mean that the emergency department can be switched to autopilot. The study primarily tested the textual component of clinical reasoning: reading medical charts, building a differential diagnosis, and suggesting the next step. The AI did not examine the patient, did not see expressions of pain, did not listen to breathing, did not assess gait, did not work with X-rays and EKGs the way a physician does at the bedside. External experts have already reminded that this is more about a "blind second opinion" based on text, rather than full real-time patient management.

"We are observing a truly profound technological shift that will change medicine," said study co-author Arjun Manrai.

But the authors themselves simultaneously emphasize the limitations. Even if the model more often correctly guesses the main diagnosis, it can suggest unnecessary tests or interventions that could harm the patient. Furthermore, there is currently no clear system of accountability: who is responsible for an error if a physician relies on an algorithm's suggestion? Therefore, the researchers speak not of replacing the physician, but of a new format of collaborative work, where AI serves as a fast analyst and source of a second opinion, while the final decision remains with the human.

What This Means

For medicine, this is a signal that large language models are exiting the demonstration phase and approaching real clinical validation. The near-term scenario is not an autonomous "AI doctor" without people, but systems that seamlessly review electronic charts, suggest missed diagnostic possibilities, and help prioritize cases in the emergency department more quickly. The next phase is now clear: not new benchmarks, but prospective clinical trials, where attention will be paid not only to accuracy of answers, but also to safety, cost, and impact on treatment outcomes.

Hamidun News

AI news without noise. Daily editorial selection from 50+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation