Habr AI→ original

Why OpenAI, Google, and Anthropic models become more convincing but make mistakes more often

Reasoning models from major labs sound smarter, but that doesn't make them more accurate. OpenAI, Google, and Anthropic increase computation during inference…

AI-processed from Habr AI; edited by Hamidun News
Why OpenAI, Google, and Anthropic models become more convincing but make mistakes more often
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

The main problem for the largest AI labs right now is not that their models don't sound convincing enough, but that confident delivery increasingly conceals higher-level errors. OpenAI, Google, and Anthropic have spent the last two years trying to solve this by scaling compute during inference: adding chains of reasoning, multi-path search, self-checking, and more tokens per request. From the outside, this looks like progress.

But if the base model hasn't become more accurate, additional "thinking" only makes its misconceptions more coherent and plausible. The idea of scaling inference seems logical. If models are given more time and more steps to solve a task, they should make fewer mistakes.

In practice, this approach has indeed eliminated some surface-level failures: fewer absurd factual errors, fewer answers that fall apart on first review, fewer obvious demonstration failures. That's why reasoning models make such a strong impression: they speak more consistently, structure answers better, and can mimic a careful analysis process. The problem is that text coherence and output accuracy are not the same thing.

The difference is especially pronounced between simple and deep hallucinations. A simple hallucination is a fabricated date, a confused name, or a non-existent reference. You can still catch it quickly.

A deep structural error is more dangerous: the model takes a false premise, builds a long logical chain on it, adds a confident tone, and delivers a complete, convincing answer. The user sees not chaos, but neatly packaged falsehood. For tasks like analytics, document preparation, programming, medicine, or legal consultation, this type of error is far riskier than a typical random mistake.

Against this backdrop, the numbers look alarming. In a recent comparison of large OpenAI models on the SimpleQA benchmark, hallucination rates around 50% were cited. If every other response to simple factual questions turns out to be false or fabricated, this is no longer a cosmetic defect but a systemic vulnerability.

Yes, any benchmark has limitations: much depends on wording, evaluation methodology, and the specific model version. But the trend itself is telling. Releases are becoming more eloquent and computationally expensive, while fundamental reliability isn't growing at the same pace—and sometimes seems to be getting worse.

For corporate scenarios, this is enough for errors to slip into presentations, reports, or code bases unnoticed. The reason may lie in the approach itself. Additional compute at inference time doesn't create new knowledge or fix weaknesses in the training data.

It only makes the model search longer for answers within the already existing representation space. If the model's underlying worldview is distorted, a long chain of reasoning won't necessarily lead it to the truth. On the contrary, it can amplify the self-confirmation effect: the model may double-check the same incorrect hypothesis multiple times in different words, thereby making the error even more convincing.

A paradox emerges: more compute reduces the probability of a stupid mistake but increases the risk of a beautiful one. The more confident the system sounds, the lower the chance the user will stop in time and verify the foundation of the reasoning. This points to a broader conclusion for the market.

The threat to AI leaders may not come only from a new "super model," but from teams that manage to build more reliable systems on top of models: with quality retrieval, source attribution, confidence calibration, strict fact-checking, and assessment of not just fluency but truthfulness. The winner won't be the one who generates the longest answer, but the one whose answer can be trusted in real work. If the industry continues to confuse persuasiveness with intelligence, the window of opportunity for new players has truly already opened.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…