How five documents can break a RAG system and turn a knowledge base into an attack vector
RAG is considered a safe way to 'ground' LLMs on corporate documents, but the weakness is often hidden in the knowledge base itself. If a few poisoned…
AI-processed from Habr AI; edited by Hamidun News
RAG systems are often perceived as a way to reduce hallucinations and force LLMs to rely on corporate documents. But if the knowledge base is considered trusted by default, it can become the most convenient channel for prompt injection and subtle answer substitution.
Where the Weakness Lies
The problem isn't that the model "reads" documents poorly, but that it doesn't distinguish between facts and instructions the way humans do. If a knowledge base receives several specially prepared files, the RAG retrieval layer can consistently surface them in the context of relevant queries. Then the LLM sees the excerpts as part of its working environment and begins to follow hidden instructions: ignore the system prompt, change priorities, insert false conclusions, or steer the dialogue in a direction favorable to the attacker.
For a team, this is especially dangerous because the attack masquerades as normal knowledge base operation. A user asks a legitimate question, retrieval returns "relevant" chunks, and the answer appears confident and connected to the query. The logs can also seem normal: the model doesn't break, doesn't go into obvious jailbreak, and shows nothing suspicious.
But the quality of the solution drops, along with it—trust in the product, which was supposed to rely on verified documents.
Why Five Documents Are Enough
The key risk is that RAG security is often overestimated because of embeddings. It seems like vector search transforms source texts into safe mathematical abstraction, but it doesn't. Embeddings help find similar fragments, not neutralize their meaning.
If five documents are written to match popular user queries and contain malicious instructions in the right places, the system will repeatedly include them in the context. The attack doesn't require full control of the knowledge base: sometimes a few notes, FAQs, or internal policies that end up in the index without verification are enough. The effect is amplified by the retrieval mechanics themselves.
The system rarely feeds the entire document to the model—it usually slices it into chunks and selects the top matches. This means the attacker doesn't have to write a long malicious text: short but semantically "sticky" fragments that pop up in the top-k results are enough. As a result, the LLM receives not a neutral reference, but a pre-selected set of influential prompts, and the system operator may not notice for a long time that answers are drifting in the direction set by these fragments.
What Needs Protection
In production, RAG can't be protected by a single filter at the input. You need a multi-layered scheme that checks documents, extracted chunks, and the model's final answer. Otherwise, a team can clean up the user's query but let the same injection slip through in the knowledge base. A separate problem is "silent" attacks, where the system doesn't crash or show an obvious error—it simply starts confidently advising wrong actions, substituting priorities, or revealing what it shouldn't.
- Document verification before indexing for hidden instructions and suspicious patterns
- Data isolation by source, role, and trust level
- Retrieval policies: limits on single-source dominance and control of diversity
- Context filtering before feeding to the LLM and separate guardrails for the response
- Logs, red-team tests, and regular corpus reassessment after updates
Demo scenarios usually hide this problem because the corpus is small, sources are known in advance, and queries are predictable. In a working system, everything is different: documents are loaded in batches, updated without manual moderation, come from different departments, and often mix facts, advice, templates, and service instructions. In such an environment, RAG should be designed not as "search + LLM," but as a security-sensitive pipeline with clear trust zones, change audits, and separate rules for different content types.
What This Means
The main vulnerability in RAG lies not only in the model, but in the trust placed in the context supplied to it by the infrastructure. If the system works with real business data, protection should begin long before answer generation: at the document upload, retrieval, and post-processing stages. Otherwise, even a small set of poisoned files can systematically distort the result.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.