OpenAI's AI outperformed doctors in diagnosis — but scientists urge caution

Q: Источник материала?

Оригинальная публикация на IEEE Spectrum AI. Hamidun News обрабатывает и адаптирует материалы с помощью AI.

Q: Когда опубликовано?

2026-05-17. Время чтения: 3 мин.

An OpenAI LLM got the diagnosis right in 82% of cases from real emergency-care histories — more often than doctors (79% and 70%). But researchers warn there is

Hamidun News Editorial

AI monitoring · IEEE Spectrum AI

2026-05-17· 2 min

OpenAI's AI outperformed doctors in diagnosis — but scientists urge caution — Source: IEEE Spectrum AI. Collage: Hamidun News.

◐ Listen to article

OpenAI's language model for the first time surpassed doctors in diagnostic accuracy on real emergency care data. The research was published in the journal Science on April 30.

What the Study Showed

The o1-preview model from OpenAI analyzed medical histories from 76 real cases in the emergency department. At different stages of treatment—upon admission, after physician examination, after transfer to another department—the model made diagnoses in parallel with two physicians. And it guessed more often: at the final stage, 82% correct diagnoses versus 79% and 70% for physicians. Interestingly, both humans and the model showed better results when there was more information. But AI maintained an advantage at all stages, even with incomplete data.

82% diagnostic accuracy versus 79% and 70% for physicians
Tested on real emergency care histories
The model analyzed complete sets of details
Improved results with each new piece of information

But Doctors Are Cautious

The study authors themselves are quick to clarify: AI does not replace doctors. "I don't think our results mean that AI will displace physicians," says co-author Arjun Manrai from Harvard Medical School. His colleague Adam Rodman, a medical instructor in Boston, adds: "The results are cool, don't get me wrong, but I'm slightly concerned about how they might be used." The main issue is that there is no unified standard for evaluating LLMs on medical tasks. Some researchers consider it a success if a model identifies 5 out of 7 possible diagnoses. Others view this as a complete failure. The same result is evaluated differently.

The Problem with Chatbot Reliability

Parallel research shows that chatbots often lie about medical questions. Nearly half of the answers contain errors: fabricated sources, inaccurate advice, confident delivery of falsehoods. The model looks equally convincing whether it's correct or not.

"These models are used every day, and there is a certain risk that no one measures or mitigates," —

Arya Rao, Harvard

For a physician, the task is more complex: when the model provides a consultation, the doctor needs to quickly understand whether it's correct or a hallucination. Of course, a physician will better understand what information matters. But detecting lies in a convincing answer is a challenge.

What This Means

OpenAI has already launched ChatGPT for doctors and healthcare. Technology is moving faster than medicine can regulate and test. Real clinical trials and clear workflows are needed, where the physician uses AI as an assistant in consultations, not as the final answer. Speed of innovation is important, but responsibility is needed more.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com