Habr AI→ original

BPE против морфем: почему ваш AI до сих пор не умеет читать

Золотой стандарт токенизации BPE, который используют GPT и Claude, безнадежно устарел. Он делит слова по частоте символов, превращая осмысленные термины в кашу

AI-processed from Habr AI; edited by Hamidun News
BPE против морфем: почему ваш AI до сих пор не умеет читать
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

Imagine learning a foreign language, but your textbook is cut into random pieces of paper. Instead of learning the root "ход" and understanding dozens of words from "выхода" to "перехода," you're forced to memorize each letter combination as a unique hieroglyph. This is how the world's most advanced language models see things today.

While we marvel at GPT-4 or Claude 3's capabilities, their foundation contains an architectural time-bomb called BPE or Byte Pair Encoding. This algorithm became the industry standard back in 2016, and almost no one has questioned it since. The problem is that BPE is a sociopathic mathematician who couldn't care less about linguistics.

It cuts text into tokens based solely on character frequency. As a result, the word "paratrooper" becomes a meaningless sequence of "par," "atro," and "oper" for the model. The model wastes billions of computational cycles and vast swaths of its parameters simply reconstructing the logical connections between these scraps that your brain reads instantly.

For a long time, it was believed that if you feed a neural network enough data, it will learn grammar and morphology rules on its own. And it does learn, but does so extremely inefficiently. A group of researchers decided to test what would happen if they reintroduced common sense and linguistic structure into the training process.

They tested MorphBPE and MorphPiece approaches, which force the tokenizer to respect morpheme boundaries: prefixes, roots, and suffixes. The results were sobering for the "pure mathematics" devotees. Models using morphological tokenization show a 25% accuracy improvement on the LAMBADA test, which measures the ability to predict the last word in a sentence.

But even more important is convergence speed: such networks train twice as fast. In one experiment, a model that completed just 200,000 training steps with proper tokenization matched the quality of answers from GPT-2 Large, which is six times larger in terms of parameters. This is a direct signal to the market: we can get the same results on much cheaper hardware if we stop feeding algorithms "word mush."

Why haven't OpenAI, Google, and Anthropic switched to this method yet? The answer lies in inertia and the complexity of implementation for multilingual systems. BPE is universal — it doesn't matter whether you feed it English text, Python code, or Chinese characters.

Morphological analysis requires customization for each specific language, which complicates the data preparation pipeline. However, the current crisis in training costs and the shortage of quality texts are pushing engineers to seek new optimization paths. When the cost of training a flagship model exceeds hundreds of millions of dollars, saving 50% of convergence time becomes a matter of business survival.

Additionally, the morphological approach solves the problem of rare words and neologisms. If a model understands morpheme meanings, it can logically deduce the meaning of a word it sees for the first time, instead of guessing blindly based on token combinations.

We are now witnessing a quiet comeback of classical linguistics in the era of neural network hype. It's becoming clear that endless brute-force scaling by increasing the number of GPUs is a dead-end path. The future lies in hybrid solutions, where a deep understanding of language structure at the input frees the model from reinventing the wheel within its layers.

Most likely, in the architecture of GPT-5 or its successors, we will see an abandonment of primitive BPE in favor of more intelligent text-splitting systems. This is not just a technical detail, but a fundamental shift in how machines perceive human culture encoded in words. While researchers refine MorphPiece and Unigram algorithms with morphological enhancements, developers should prepare for the fact that old dataset preparation methods will soon be consigned to the dustbin of history.

The bottom line: Using morphemes instead of random syllables makes models smarter and twice as cheap to train. Is the industry ready to admit that linguists were right from the beginning, or will we continue burning electricity trying to teach AI to read syllable by syllable?

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…