Инференс

Context Window

A context window is the maximum number of tokens a language model can process in a single inference call, covering both the input prompt and the generated output. Exceeding it causes input truncation or an API error; larger windows enable full-document analysis without external retrieval systems.

The context window defines the upper bound on the sequence length a transformer-based language model can attend to in one forward pass. Its size is determined at training time by the positional encoding scheme and the sequence lengths the model was trained on. Everything the model can see simultaneously — the system prompt, conversation history, retrieved documents, tool call results, and the in-progress generated response — must fit within this limit, measured in tokens.

Transformers compute attention over all tokens in the context, so compute and memory scale quadratically with context length in the naive implementation. Techniques such as sparse attention, sliding-window attention (used in Mistral 7B), Flash Attention 2 and 3, and ring attention have made very long contexts practical. Positional encodings (RoPE, ALiBi, and others) communicate each token's position in the sequence; models trained with RoPE can often be extrapolated beyond their training length via fine-tuning techniques such as YaRN, which extended LLaMA 2's native 4k context to 128k in community models. KV-cache memory grows linearly with context length, making very long contexts GPU-memory intensive at inference time.

Context window size directly determines which tasks a model can perform without retrieval augmentation. A 4,096-token window cannot hold a full research paper; a 1,000,000-token window can ingest an entire software repository or a multi-hundred-page document, enabling in-context question answering without an external vector database. Longer contexts also allow retaining full conversation histories, removing the need for lossy summarization between turns.

Context windows expanded rapidly between 2023 and 2026. GPT-4 launched in 2023 with 8k tokens (32k in a separate variant); by 2026, Claude 3.5/4 supports up to 200k tokens, Gemini 1.5 Pro established 1M tokens as a production capability in 2024, and Gemini 2.0 Flash supports 1M tokens. A persistent practical limitation is the "lost in the middle" effect: models tend to attend more strongly to the beginning and end of long contexts, causing information in the middle of very long sequences to be underweighted despite being technically within the window.

Пример

A legal team uses a 200,000-token model to ingest an entire 600-page merger agreement in a single API call, asking targeted questions about indemnification clauses without first chunking the document or building a retrieval index.

Связанные термины

Токен KV-Cache Prompt Caching RAG (генерация с дополненной выборкой)

← Глоссарий