Why ChatGPT and other LLMs became far more powerful than simple "word prediction"

Q: What is the source?

Originally published on Habr AI. Hamidun News processes and adapts the material with AI.

Q: When was it published?

May 2, 2026. Reading time: 3 min.

LLMs still predict the next token, but the key advances of recent years were built on top of that mechanism. Models learned to say "I don't know," call…

Hamidun News Editorial

AI monitoring · Habr AI

May 2, 2026· 3 min

AI-processed from Habr AI; edited by Hamidun News

Why ChatGPT and other LLMs became far more powerful than simple "word prediction" — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

Large language models still build their response as next-token prediction, but this seemingly simple mechanism turned out to be far more productive than even many researchers expected. The sharp growth in LLM quality is explained not only by scale, but by how self-critique, tools, and multi-step reasoning were layered on top of the base model.

Where the Skepticism Came From

Even in 2024, a popular explanation sounded like this: LLMs are giant text autocompletes that don't understand meaning, but merely continue token sequences. From this came a direct conclusion: if the foundation is so primitive, then the quality ceiling for such systems should be low. Hallucinations, template-like responses, and poor performance on tasks requiring fresh data only reinforced this view.

A typical example—a question with specific real-world details, like whether it's cheaper to fly from London to Barcelona or take a train next Friday. Early models answered in generalities: planes are usually faster and cheaper, trains are more comfortable and eco-friendly. Such an answer might sound plausible, but didn't help make a decision.

That's why it seemed to many that scaling alone wasn't enough: what was needed wasn't a larger autocomplete, but a different level of behavior.

What Was Added on Top

The first important layer on top of the base model was the ability to recognize its own uncertainty. Instead of confident fabrications, modern LLMs increasingly can say they don't have access to real-time data, lack context, or should consult an external source. This seems like a cosmetic improvement, but in reality sharply increases usefulness: the model stops masking knowledge gaps and begins correctly marking the boundaries of its competence.

The second layer is tool calling. From an architecture perspective, the model still generates tokens, but now the environment interprets certain tokens as commands: do a web search, call an API, access a database, or run a small script. As a result, the LLM no longer has to remember everything in its weights: it can get missing facts right during response generation and continue reasoning based on them.

Check current prices, weather, or schedules via web search
Access corporate knowledge bases or external APIs
Run Python scripts for calculations and comparisons
Rerun queries if initial results appear outdated or contradictory

Why This Worked

But the most unexpected leap in quality came not just from tools, but from reasoning training. At first it looked like step-by-step thinking prompting, which helped the model parse tasks more carefully. Then reinforcement learning entered the picture, and later—verifiable reward approaches, where the correctness of a math or code answer can be checked automatically. The model began not just outputting answers, but increasingly choosing trajectories that actually lead to correct solutions.

"Reinforcement learning is always aimed at an outcome.

In this case, that outcome became reasoning."

From this grew another idea: if the model already knows how to think step-by-step, it can be given more time to reason. Additional tokens during response generation become not empty chatter, but exploration of alternatives, self-checking, and retreat from failed hypotheses. Essentially, part of intelligence is now determined not only by what was memorized during training, but by how much computation the system spends at query time.

It's precisely the combination of reasoning and tools that makes modern LLMs so much stronger than earlier versions. In the train-and-plane task, a good model first understands what data it lacks, then searches for prices, compares connections and route duration, calculates outcomes through code if needed, and then double-checks that results haven't become outdated. This is no longer just a beautiful text response, but a working decision-making loop built on top of the same next-token prediction mechanism.

What This Means

The success of LLMs is now explained not by magic and not by scaling alone, but by engineering layered on top of a basic principle. Models can still make errors, get stuck in loops, and hallucinate, but the combination of self-critique, tools, and RL-reasoning transformed "text autocomplete" into a system that genuinely helps solve practical problems.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation