Habr AI→ original

Why GPT gets letter counts wrong: the secret of tokenization

LLMs process text as a sequence of numerical tokens rather than individual letters. This explains a strange paradox: GPT can easily get confused by simple lette

Why GPT gets letter counts wrong: the secret of tokenization
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

When you write a prompt in ChatGPT, you see ordinary text made of letters. But the model sees something entirely different — a sequence of numbers called tokens. This discrepancy creates strange effects: GPT may not understand how many letters are in the word strawberry, or get confused with simple counting.

How LLMs See Text

For neural networks, text is not a set of letters, but a sequence of numerical codes. Each token corresponds to a number, and the model works only with numbers, never seeing the actual letter. It's as if you were reading a book through a system that first translates words into codes, you process them, and then the codes are translated back into text. One frequent word can be a single token, while a rare word is split into several pieces. For example, "hello" is encoded as a single number, while "strawberry" might be three or four numbers. In Russian, the situation is even more complex — due to rich morphology, words are broken down less efficiently.

Why This Creates Problems

The discrepancy between how people see text and how the model sees it leads to a whole range of problems:

  • Counting errors: GPT cannot simply count letters because it works with numbers, not characters
  • Unreadability of rare words: rare letter combinations are split into multiple tokens, and the model sees them as separate pieces
  • Language asymmetry: English breaks down into tokens more efficiently than Russian, Chinese, or Arabic
  • Context consumption: if a word split into 3 tokens instead of 1, your prompt takes up more space in the context window
  • Unpredictable behavior: the model can behave strangely with numbers, codes, and rare names because they are split into fragments

The Algorithm Behind the Scenes

Behind tokenization lies the Byte Pair Encoding (BPE) algorithm. Here's how it works: first, the text is split into each character as a separate token. Then the algorithm looks at which pairs of characters occur most frequently and combines them into a new token. After that, it searches for frequent pairs of the resulting tokens — and combines them again. This repeats thousands of times. As a result, GPT's vocabulary contains about 50 thousand tokens. Frequent words and word parts become separate tokens, rare letter combinations remain fragmented. It's not ideal, but more efficient than encoding each letter separately.

What This Means

Understanding tokenization changes how you approach working with LLMs. If you know that the model will be confused with letter counting, you can ask it to work differently — for example, first print the letters separately, then count them. It's not a panacea, but it helps write more reliable prompts. Knowledge of tokenization is useful when optimizing long prompts — you can predict where the model will "spend" tokens unnecessarily. This is important for anyone working with LLMs at a deep level, from prompt engineers to developers building applications based on neural networks.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.
What do you think?
Loading comments…