Gramax showed how to compare RAG answer quality without manual eye evaluation

Q: What is the source?

Originally published on Habr AI. Hamidun News processes and adapts the material with AI.

Q: When was it published?

Apr 28, 2026. Reading time: 3 min.

Gramax shared how it moved away from subjective RAG answer evaluation and started comparing models by what users actually receive. The team separated search…

Hamidun News Editorial

AI monitoring · Habr AI

Apr 28, 2026· 2 min

AI-processed from Habr AI; edited by Hamidun News

Gramax showed how to compare RAG answer quality without manual eye evaluation — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

Gramax has described the practical transition from subjective evaluation of RAG systems to reproducible answer comparison: the team proposes focusing not on how well retrieval metrics look, but on whether users receive accurate, complete, and understandable answers from the knowledge base. The typical problem with almost any RAG search through documentation or internal knowledge bases is that even if the system finds relevant chunks well, it does not guarantee a quality final answer. The user does not see DCG, Recall@10, reranking, and other internal indicators.

They only see the final text. It is at this level where the main failures emerge: the model may ignore part of the found context, answer in the wrong language, add unverified details, or produce confident but hard-to-read text.

Gramax notes that they have already worked on improving the retrieval layer: selecting chunking schemes, adding metadata, combining different search types, and using result reranking. This set of techniques does increase the chances of extracting necessary fragments from the knowledge base. But after stabilizing search, the next question arises: how do you know the entire chain works for the end user, not just for the engineer watching the technical dashboard? In practice, this gap between search quality and answer quality often becomes the cause of false optimism in RAG development.

The key idea is that evaluation should be tied to user scenarios. If someone asks a documentation question, they care not about a list of successfully extracted chunks, but about the actual answer: is the necessary fact there, was no important nuance lost, are there hallucinations, was the request language respected, and can the formulation be trusted? This shift in focus requires building quality verification differently. Instead of "eyeball" evaluation, the team proposes fixing a set of criteria and comparing models and configurations on the same set of questions. This is especially important when differences are subtle and subjective impressions easily distort the overall picture.

A separate practical conclusion concerns model comparison. In their material, Gramax emphasizes that for RAG tasks it is not enough to rely only on general benchmarks or a model's market reputation. The same model can be strong in generation but weaker in the discipline of answering based on found context. Therefore, comparison must be done in an applied setting: on your own questions, your own knowledge base, and with clear validation rules. This way you can see which model better retains facts, does not drift into fantasy, works correctly with language, and consistently answers similar queries.

For the market, this is an important signal. RAG projects are increasingly being deployed in support, internal guides, regulation databases, and product documentation, where an error in the answer costs more than a drop in an abstract search metric. The approach Gramax describes effectively moves the quality conversation from the engineering plane to the product plane: a good system is one that consistently delivers useful and verifiable answers to users, not one that looks good in retrieval reports.

The sooner teams start measuring this level, the faster they will stop confusing found context with actually solved user tasks. This means the next stage of RAG system evolution will be linked not only to search improvement but to normalizing answer evaluation as a separate product. For teams that have already configured chunking, hybrid search, and reranking, precisely this methodology can be the main way to understand which combination of models and prompts actually works in production.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation