RAG systems break on real data: the culprit is retrieval, not the model
RAG pipelines often break not because of the model, but because of retrieval. When the system retrieves the wrong document or text chunk, even GPT-4 starts hall

A RAG pipeline looks like magic: load documents, chunk them, generate embeddings, connect a vector database. Ask a question — the model answers confidently and in detail. Show it to the client, they're impressed. But then real testing on actual questions begins, and it turns out the system misses half of them.
RAG's Bottleneck
On real questions, the system often answers incorrectly. It finds the wrong document entirely, or finds the right document but extracts the wrong text chunk, or retrieves nothing relevant at all, and the model confidently hallucinates. It seems the problem is with the model. In reality, retrieval is at fault.
GPT-4 and Claude answer perfectly if given the right context. If the context is wrong — hallucination is guaranteed, no matter how good the model is. The model answers only as well as the context provided to it.
The problem is not in the models. The problem is in retrieval — in how we search for relevant document chunks from your database. This is the bottleneck through which the entire RAG pipeline passes. If retrieval gives the model the wrong context, everything else is wasted time and money.
"The model answers only as well as the context provided to it."
When Retrieval Breaks
Retrieval can break for dozens of reasons. Here are the most common:
- Chunking too large or too small. A 512-word chunk can capture neighboring context instead of the needed piece. A question about return policy, chunk contains a size table description
- Embeddings generated for English-language data, but the question is in Russian. The semantic distance between the vector of the question and vectors of documents is huge, no matches
- The question is rephrased such that the vector of its embedding does not match vectors in the database. You're looking for "return order", documents contain "product return" — different semantics
- Relevant document found 8th out of 10, but the model was fed only top-5 results. The needed context simply doesn't fall within the visibility window
- Index filled with duplicates and noise. Many irrelevant chunks push the correct information out of the results
Each of these problems leads to the same result: the model hallucinates instead of providing the correct answer.
The Cost of Retrieval
Optimizing retrieval is not a hobby for enthusiasts. It's real work: a week or two spent by a developer on analysis will reveal the system is 30-40% below expected accuracy. The reason is not the model, but that retrieval is searching incorrectly. On real projects, this is a huge loss: time spent on RAG, money spent on infrastructure, and the system doesn't work because document vectors don't match question vectors.
What This Means
RAG works only if retrieval works. Without it, even the best model will be wrong. This means that before launching RAG to production, you need to invest serious time in optimizing search, testing on real data, and then iteratively improving the retrieval pipeline.