Habr AI→ original

NGT Memory Module Shows How to Give LLM Persistent Memory Without Vector Database

NGT Memory is an open-source memory module for LLM that stores facts about the user between sessions and injects them into model responses. Inside: cosine…

AI-processed from Habr AI; edited by Hamidun News
NGT Memory Module Shows How to Give LLM Persistent Memory Without Vector Database
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

The open-source NGT Memory project offers a more practical answer to an old LLM problem: how not to forget important facts about a user between sessions. Instead of endlessly expanding context, the module stores facts, builds a profile, and injects memory into the model's response.

How Memory Works

The project's author starts from a simple idea: storing the entire conversation history in the context window is convenient only until the first token limit hits. After that come truncations of old messages, summarization with loss of details, or connecting an external vector database with extra infrastructure overhead. NGT Memory proposes another path: memory lives directly in a Python process, is raised through REST API and Docker, and doesn't require a separate database at launch. The idea is to extract not the entire conversation, but only the facts that the model really needs in the current request.

  • Cosine similarity finds semantically close facts
  • Associative graph links concepts like "vegetarian" and "restaurants"
  • Hierarchical consolidation promotes frequently used facts into long-term memory
  • Structured profile puts age, city, diet, and allergies into separate slots

On top of textual memory, the module also builds a user profile. If a person writes fragments, the system can glue together pieces like "I'm" + "30" + "years old" into one meaningful fact and save it with reduced confidence. If later a conflict arises, for example the age suddenly decreases, the update is blocked until explicit correction like "I made a mistake". Because of this, memory behaves not as a passive log, but as a validation layer that tries to distinguish real user data from random noise.

Test Results

The most interesting part of the article is not the architecture itself, but verification on practical scenarios. In one experiment with a medical assistant, personal assistant, and tech support, the memory mode raised factual accuracy from 1.22 to 2.44 points out of 3. In a more realistic A/B test, memory won 17 out of 18 ratings, and the average answer rating was 0.889 versus 0.056 without memory. Separately, the author ran 54 checks on edge cases and got 51 successful results, that is about 94%.

"An answer to garbage is also garbage".

This formula well explains another important layer of the system: a quality filter. Short meaningless messages, plain numbers, special characters, and even assistant responses to such garbage don't get into memory. Otherwise, useful facts would quickly drown in noise. Such a filter is especially important for chatbots and agentic systems, where the user often writes in fragments, and the model itself tends to generate polite but empty phrases that only clutter the output on the next memory search.

Main Limitations

Despite good results, NGT Memory cannot yet be called a ready universal production memory without caveats. The current version stores everything in the RAM of a single process, so after a container restart the state is lost. When running multiple workers, another problem arises: each holds its own session storage, and a save request may go to one process while a retrieval request goes to another.

In tests, local memory works fast — about 2–3 ms on CPU, but the main delay still comes from external embedding and chat calls, which take hundreds of milliseconds. The author also separately describes more mundane bugs. Too soft system prompt led to the model seeing memory but ignoring it as optional advice.

After tightening the instruction, answers became more stable. Another unpleasant case came from a regex for extracting a name: a template like I'm + name started accepting words like allergic as a name. Such details well show that memory for LLM runs into not just retrieval, but also a host of small rules, without which the demo quickly falls apart in real dialogue.

What This Means

NGT Memory shows a useful shift in the approach to LLM applications: memory needs to be not just "smart", it needs to be verifiable, cheap, and resistant to garbage. For developers of bots and AI agents, this is a good signal that a layer of long-term context can already be assembled as a separate engineering module, rather than as a set of disconnected workarounds.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…