Token
A token is the basic unit of text that a language model processes, typically a word, subword fragment, or punctuation mark. In common English prose, one word corresponds to approximately 1.3 tokens under widely used subword vocabularies.
A token is the atomic unit into which text is decomposed before being fed into a language model. Most modern models use subword tokenization, splitting text neither at the character level nor strictly at word boundaries, but into segments that balance vocabulary size with sequence length. The most common algorithms are Byte-Pair Encoding (BPE, used by the GPT family), WordPiece (used by BERT and its derivatives), and SentencePiece (used by LLaMA, T5, and many multilingual models). Vocabularies typically contain 32,000–200,000 token types; GPT-4's cl100k_base tokenizer contains approximately 100,000.
In practice, high-frequency English words such as "the" or "model" are usually single tokens, while rare words, technical terms, and most non-English text split into multiple tokens. Code and URLs also tokenize less efficiently than prose. The tokenizer is trained separately from the model on a large representative corpus, and its vocabulary is fixed before model training begins. At inference time, raw text is first converted into a sequence of integer token IDs, which are used to look up dense embedding vectors that the model's layers process.
Tokens matter for three practical reasons. First, context windows — the maximum sequence a model can process at once — are measured in tokens; as of 2025, leading models ranged from 128,000 tokens (GPT-4o) to 200,000 (Claude 3.5 Sonnet) to over one million (Gemini 1.5 Pro), with continued expansion underway. Second, cloud inference APIs charge per token consumed and generated, making token efficiency a direct cost driver. Third, standard transformer attention scales quadratically with sequence length in tokens, making longer contexts computationally expensive and motivating research into linear-attention and sparse-attention variants.
As of 2026, the token abstraction has extended beyond text to multimodal models. Images are typically encoded as a fixed number of visual tokens (commonly 256–1,024 per image) concatenated with text tokens before processing by a unified transformer. Audio and video inputs follow analogous quantization-then-tokenization pipelines, making the token the universal currency across multimodal AI architectures.