Habr AI→ original

Habr AI Shows How to Add Memory and Context to an LLM Chat in Python with Ollama and LiteLLM

In the third part of the series on LLM chat in Python, the author added what transforms dialogue from isolated queries into continuous conversation—message…

AI-processed from Habr AI; edited by Hamidun News
Habr AI Shows How to Add Memory and Context to an LLM Chat in Python with Ollama and LiteLLM
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

A simple console chat with an LLM stops being a true chat if the model only sees the current question. In a new part of the practical Python guide using Ollama and LiteLLM, the author shows the most important step after basic integration: how to add message history and turn a chain of separate requests into a coherent dialogue with context. The problem with the previous version was that at each turn, the program sent the model only the system instruction and a fresh user reply.

For a human, this looks almost like an imperceptible limitation until dependent questions appear like "make it shorter" or "add practice to it". Without previous messages, the model doesn't know what the pronouns and clarifications refer to, and therefore may respond randomly or lose the thread of conversation. This is exactly what distinguishes a one-time model call from a dialogue interface.

The analysis emphasizes an important idea that is often oversimplified in LLM discussions: the model has no hidden long-term memory between calls. Memory in such a chat is not created by Ollama or LiteLLM itself, but by the application that stores the conversation and sends it to the model anew with each new request. In the educational example, an ordinary Python list conversation_history is used for this, where messages with user and assistant roles are recorded in turn.

The system prompt is not stored in the history but is added separately each time the request is assembled. Architecturally, the change looks small, but radically changes the program's behavior. The request sending function now takes not just the current user_message, but also the history.

Then a list of messages is formed in strict order: system, then all past replies, then the new user question. After the model's response, the application saves both sides of the exchange — both the question and the answer. This is not a decorative detail: if you only record user replies, the next request will be incomplete because the model will see that the question was already asked but won't see how it answered itself.

Separately, history limitation is also discussed. In the example, a constant MAX_HISTORY_MESSAGES = 6 is introduced, and a helper function trim_history keeps only the last six messages, i.e.

, the last three message exchanges. For a local prototype, this is a practical compromise: the history doesn't grow infinitely, requests don't bloat, and the model still gets the nearest context. This is a good way to show that memory in LLM applications always has to balance between answer quality, latency, and cost or load, even when it comes to a local model.

The article uses a local configuration based on Ollama, LiteLLM, and the qwen2.5:3b model. This stack is convenient for learning because it allows you to build a working chat without an external API and without complex infrastructure.

In the examples, you can clearly see how the system's reaction changes after adding history: the request "tell me more about the second one" now correctly refers to the previous list, and the phrase "make it shorter" is no longer perceived as a separate abstract command. That is, the model doesn't become smarter by itself, but the application gives it the context that previously was simply lost. At the same time, the limitations of this version are also named directly.

All history is stored only in the RAM of the current run: if you close the script, the dialogue disappears. For an educational CLI, this is enough, but for a Telegram bot, web service, or client assistant, permanent storage would be needed — at least in a file, SQLite, or a full database. The next steps usually include also summarizing long dialogues, filtering irrelevant replies, and more flexible context window management.

The main conclusion from this part is very grounded and therefore useful: memory in an LLM chat is not a separate magical function of the model, but the ordinary responsibility of the code around it. Just a few lines with conversation_history, the correct order of messages, and a simple history size limit already turn the demo script into the foundation for a more realistic assistant. For developers, this is a good example of how the value of a chat interface is often born not in the choice of the model itself, but in how carefully the application collects and passes the conversation context.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…