Tokenization
Tokenization is the process of splitting raw text into discrete units called tokens — typically subword fragments — that a language model numerically encodes and processes. A token averages roughly 4 characters in English; most modern LLMs use subword vocabularies of 32,000–200,000 entries.
Tokenization is the preprocessing step that converts a string of text into a sequence of integer IDs drawn from a fixed vocabulary. Each ID corresponds to a token — which may be a word, a subword fragment, a single character, or whitespace — depending on the tokenizer design. Vocabulary sizes of common modern tokenizers range from about 32,000 (LLaMA 2's SentencePiece tokenizer) to roughly 100,000–200,000 (GPT-4o's cl100k tiktoken vocabulary).
The dominant approach since roughly 2018 is Byte-Pair Encoding (BPE), in which frequent character pairs are iteratively merged into subword units during a training phase on a large text corpus, producing a vocabulary that balances coverage of rare words with efficient encoding of common ones. Alternatives include WordPiece (used in BERT-family models) and Unigram LM (used in SentencePiece, which underpins LLaMA and Mistral models). The tokenizer vocabulary is fixed before the language model itself is trained. At inference time, input text is encoded to a token ID sequence, and the model's output IDs are decoded back to readable text.
Tokenization directly determines system cost and model behavior: API pricing is denominated in tokens, context window limits are expressed in tokens, and models operate purely on integer IDs — they never see raw characters. Languages with large character sets (Chinese, Japanese) or highly agglutinative morphology (Finnish, Turkish) are encoded less efficiently than English, meaning equivalent semantic content can consume two to four times as many tokens and thus proportionally more compute and cost.
As of 2026, most frontier models — GPT-4o, Claude 3.5/4, Gemini 2.0 — use BPE-style tokenizers with vocabularies in the 100k–200k range. Research into tokenizer-free and byte-level architectures (MegaByte, MEGALODON) continues, aiming to eliminate the tokenization bottleneck entirely, but subword tokenization remains the dominant production approach across both open-weight and proprietary models.