OpenAI's AI outperformed doctors in diagnosis — but scientists urge caution
An OpenAI LLM got the diagnosis right in 82% of cases from real emergency-care histories — more often than doctors (79% and 70%). But researchers warn there is

OpenAI's language model for the first time surpassed doctors in diagnostic accuracy on real emergency care data. The research was published in the journal Science on April 30.
What the Study Showed
The o1-preview model from OpenAI analyzed medical histories from 76 real cases in the emergency department. At different stages of treatment—upon admission, after physician examination, after transfer to another department—the model made diagnoses in parallel with two physicians. And it guessed more often: at the final stage, 82% correct diagnoses versus 79% and 70% for physicians. Interestingly, both humans and the model showed better results when there was more information. But AI maintained an advantage at all stages, even with incomplete data.
- 82% diagnostic accuracy versus 79% and 70% for physicians
- Tested on real emergency care histories
- The model analyzed complete sets of details
- Improved results with each new piece of information
But Doctors Are Cautious
The study authors themselves are quick to clarify: AI does not replace doctors. "I don't think our results mean that AI will displace physicians," says co-author Arjun Manrai from Harvard Medical School. His colleague Adam Rodman, a medical instructor in Boston, adds: "The results are cool, don't get me wrong, but I'm slightly concerned about how they might be used." The main issue is that there is no unified standard for evaluating LLMs on medical tasks. Some researchers consider it a success if a model identifies 5 out of 7 possible diagnoses. Others view this as a complete failure. The same result is evaluated differently.
The Problem with Chatbot Reliability
Parallel research shows that chatbots often lie about medical questions. Nearly half of the answers contain errors: fabricated sources, inaccurate advice, confident delivery of falsehoods. The model looks equally convincing whether it's correct or not.
"These models are used every day, and there is a certain risk that no one measures or mitigates," —
Arya Rao, Harvard
For a physician, the task is more complex: when the model provides a consultation, the doctor needs to quickly understand whether it's correct or a hallucination. Of course, a physician will better understand what information matters. But detecting lies in a convincing answer is a challenge.
What This Means
OpenAI has already launched ChatGPT for doctors and healthcare. Technology is moving faster than medicine can regulate and test. Real clinical trials and clear workflows are needed, where the physician uses AI as an assistant in consultations, not as the final answer. Speed of innovation is important, but responsibility is needed more.