LLM Context Window: Why Neural Networks Forget Parts of Your Conversation
LLMs do not retain memory between requests — with each new message, the model rereads the entire conversation from scratch. This 'visibility box' is called a…
AI-processed from Habr AI; edited by Hamidun News
LLM works not like a human with memory — but like an expert who re-reads the entire correspondence from scratch each time and only then formulates a response. This is precisely the key architectural feature of modern neural networks that often confuses new users.
Why the Model "Forgets"
When you send a new message to a chat with AI, the model doesn't "remember" the previous response in the conventional sense. It has no operational memory like a computer, and no long-term memory like a human. Each time you write something new, the model receives the entire dialogue as input — from the very first message to the last — and reprocesses it anew to formulate a response. This limited "box" that holds the entire conversation is called the context window. Its size is measured in tokens — units of text that roughly correspond to 0.75 words each. The longer the conversation, the more tokens it takes up — and the closer it gets to the limit.
What Happens at the Limit
The context window is not infinite, and each model has its own ceiling. Here's what the limits look like for popular solutions:
- GPT-4o — 128,000 tokens (approximately 96,000 words)
- Claude 3.5 Sonnet — 200,000 tokens (approximately 150,000 words)
- Gemini 1.5 Pro — up to 1,000,000 tokens
- Older models (GPT-3) — only 4,000 tokens
When the dialogue reaches the limit, older parts literally "drop out": the model stops seeing them. If at the beginning of a long session you wrote "my name is Andrei" or provided key task context, and then continued the conversation for several more hours — by the end, the AI will likely "not remember" these details. This is not a glitch or inattention. It's mathematics: the information simply went beyond the window.
How Developers Combat This
To hide this limitation from users or at least soften it, developers add several layers of logic on top of the base LLMs. For the average user, they are invisible — but they are what make working with AI more comfortable.
Summarization. The system automatically compresses old parts of the dialogue, preserves key facts in a compact form, and frees up tokens for new messages. Users typically don't notice this.
Vector memory. Important facts from the conversation are stored in a separate database and retrieved as needed. This is how RAG (Retrieval-Augmented Generation) systems work: they pull in the necessary context at the right moment, without constantly filling the window with it.
System prompt. Part of the context window is reserved in advance — for permanent instructions, user profile, and task facts. This part is not displaced by dialogue history.
Caching. Some providers cache part of the context on the server side so that the same data doesn't need to be transmitted with each request. This reduces computational costs and slightly speeds up the response.
"The context window is not a bug, it's a key architectural decision of
transformers," explain ML engineers, adding: the quadratic complexity of attention operations means that doubling the window quadruples the computational costs.
What This Means
Understanding the context window explains many "oddities" in AI behavior: why the model forgets details toward the end of a long dialogue, why it only sees a fragment of a large document, why agents need a separate memory system. This is a fundamental architectural limitation — and the industry is actively learning to work with it: expanding windows, adding external memory, exploring new architectures like Mamba. For now, the context window remains one of the main tradeoffs in the world of LLMs.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.