Inference

Token

A token is the basic unit of text that a language model processes, typically a word, subword fragment, or punctuation mark. In common English prose, one word corresponds to approximately 1.3 tokens under widely used subword vocabularies.

A token is the atomic unit into which text is decomposed before being fed into a language model. Most modern models use subword tokenization, splitting text neither at the character level nor strictly at word boundaries, but into segments that balance vocabulary size with sequence length. The most common algorithms are Byte-Pair Encoding (BPE, used by the GPT family), WordPiece (used by BERT and its derivatives), and SentencePiece (used by LLaMA, T5, and many multilingual models). Vocabularies typically contain 32,000–200,000 token types; GPT-4's cl100k_base tokenizer contains approximately 100,000.

In practice, high-frequency English words such as "the" or "model" are usually single tokens, while rare words, technical terms, and most non-English text split into multiple tokens. Code and URLs also tokenize less efficiently than prose. The tokenizer is trained separately from the model on a large representative corpus, and its vocabulary is fixed before model training begins. At inference time, raw text is first converted into a sequence of integer token IDs, which are used to look up dense embedding vectors that the model's layers process.

Tokens matter for three practical reasons. First, context windows — the maximum sequence a model can process at once — are measured in tokens; as of 2025, leading models ranged from 128,000 tokens (GPT-4o) to 200,000 (Claude 3.5 Sonnet) to over one million (Gemini 1.5 Pro), with continued expansion underway. Second, cloud inference APIs charge per token consumed and generated, making token efficiency a direct cost driver. Third, standard transformer attention scales quadratically with sequence length in tokens, making longer contexts computationally expensive and motivating research into linear-attention and sparse-attention variants.

As of 2026, the token abstraction has extended beyond text to multimodal models. Images are typically encoded as a fixed number of visual tokens (commonly 256–1,024 per image) concatenated with text tokens before processing by a unified transformer. Audio and video inputs follow analogous quantization-then-tokenization pipelines, making the token the universal currency across multimodal AI architectures.

Example

A 10-page legal contract submitted to an AI assistant might contain roughly 5,000 tokens; at a typical API rate of a few dollars per million input tokens, processing that document costs roughly a few cents in inference fees.

Related terms

Tokenization Context Window Inference

Latest news on this topic

Together AI launches MiniMax M3 with 1 million-token context and multimodal support2026-06-30 75,000-star Caveman skill promises to cut Copilot token usage by 75%2026-06-30 Liquid AI released LFM2.5-230M: 213 tokens/s on Galaxy S25 and support for llama.cpp2026-06-28 graphlens-mcp builds a code graph with one command and cuts token usage by 10–23x2026-06-28 Z.ai releases GLM-5.2: real million tokens and two levels of deep thinking2026-06-15

← Glossary