Qdrant and DRAG with KNEE: How to Make RAG Adaptive and Avoid Wasting Extra Tokens

Q: What is the source?

Originally published on Habr AI. Hamidun News processes and adapts the material with AI.

Q: When was it published?

Apr 30, 2026. Reading time: 3 min.

We demonstrate how to solve the main problem of classical RAG: either empty context or excessive text for the LLM. The DRAG with KNEE approach based on…

Hamidun News Editorial

AI monitoring · Habr AI

Apr 30, 2026· 3 min

AI-processed from Habr AI; edited by Hamidun News

Qdrant and DRAG with KNEE: How to Make RAG Adaptive and Avoid Wasting Extra Tokens — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

A practical breakdown has come out on how to make a RAG system less resource-hungry and more accurate. At its core is DRAG with KNEE, an approach based on Qdrant and Python that selects the volume of context dynamically, rather than by a fixed top_k.

Why static top_k breaks

Almost anyone who has built RAG on top of long PDFs has encountered the same problem: if you take too few chunks, the model misses important context and starts to make things up. If you take too many, noise enters the prompt, and along with it come growing costs, latency, and the risk that the LLM will latch onto a random fragment. A single top_k parameter in such a setup tries to solve too many different problems and almost always does it poorly.

The author calls this compromise a fundamental weakness of classical RAG. A fixed number of documents does not account for either the query type, the structure of the source file, or the density of useful information within the corpus. For a short fact, a couple of excerpts may be enough, but for a complex question about a multi-page document, it is not.

As a result, the system either underfeds the model with context or, conversely, overloads it with irrelevant text and burns through the token budget.

How DRAG works

The idea of DRAG with KNEE is not just to find similar chunks, but first to look at documents as a hierarchy, and then dynamically decide where to stop the selection. Instead of a hard limit, the algorithm analyzes the relevance distribution and looks for an inflection point—that very knee after which added fragments provide less and less benefit. Everything that goes into the long tail after this point can be cut off without noticeable loss of meaning.

In practice, this looks like a more adaptive strategy for context extraction. The system is not obliged to return the same number of chunks for each query: in one case there will be three, in another ten, and in a third several related groups from different parts of the document. Because of this, RAG adapts better to the actual structure of knowledge, rather than to a pre-selected constant.

First, candidates are found by vector similarity
Then they are grouped and ordered by documents and levels
After that, the algorithm looks for the point where usefulness begins to drop sharply
Only the relevant core without the long tail of noise enters the final context

Such an approach is especially useful where knowledge lies not in a neat FAQ base, but in scattered instructions, reports, regulations, and large PDFs. In such corpora, distances between fragments by themselves tell you little if you do not take into account how these fragments are connected to each other and how quickly their value for answering falls. This is exactly where geometric analysis becomes not a mathematical ornament, but a practical filter.

Why Qdrant here

A separate strength of the article is that it does not drift into pure theory. The author shows how to build such a pipeline using Qdrant and Python, that is, on a familiar stack that is already used in many RAG projects. Qdrant is responsible for vector search and working with candidates, while the logic of DRAG with KNEE adds an adaptation layer on top of it: not just find something similar, but understand how much similar content you actually need to give to the model right now.

For teams that have already deployed standard retrieval and hit a wall in answer quality or inference costs, this is an important signal. The problem may not be in embeddings or the LLM itself, but in how exactly you are cutting and feeding context. If you replace static top_k with dynamic cutoff by inflection point, you can simultaneously reduce noise and improve accuracy without completely rebuilding the architecture.

What this means

RAG is gradually moving away from rough tuning in the spirit of one parameter for all cases. The material on DRAG with KNEE shows a simple but important shift: the next level of quality becomes not only good search, but the ability to stop in time so that the LLM gets enough context for an answer, rather than a random text overload.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation