MWS AI and SberAI Propose DRAGOn — a Dynamic Benchmark for Evaluating RAG Systems
MWS AI, SberAI and academic partners presented DRAGOn — a benchmark for RAG systems with a regularly updated corpus. Instead of a static set of questions, it…
AI-processed from Habr AI; edited by Hamidun News
Researchers from MWS AI, SberAI, and several universities presented DRAGOn — a dynamic benchmark for evaluating RAG systems operating on a regularly updated corpus. The work was published on arXiv in July 2025 and in March 2026 was included in the materials of EACL 2026 as a practical way to test RAG on genuinely new data, rather than on a long-fixed set of questions.
Why this is difficult
Evaluating RAG almost always runs into the same problem: tests become outdated faster than the systems themselves. If a benchmark is built on a fixed corpus, a model can show high results not because it searches and links documents well, but because it has already seen part of the facts during training. To this is added another complication: in the final quality, it is difficult to separate the contribution of the retriever from the contribution of the generator. And manual preparation of "question-answer" pairs for continuous verification is too expensive, slow, and barely scales for teams that want to regularly compare new versions of their pipelines.
How DRAGOn is structured
The authors propose building the benchmark as a pipeline. Parsers regularly pull materials from news sources, then a separate module extracts atomic facts from texts in the form of "subject-relation-object" triplets. After this, the system checks entities against Wikidata and discards already known facts, so the sample contains precisely new knowledge. From the resulting graph, questions of varying complexity are automatically constructed, and the benchmark itself can be regularly released without manual reassembly and with clear version control.
- Simple — a question about a single fact
- Set — an enumeration of several objects with a common relation
- Multi-hop — a question through an intermediate entity
- Conditional — an answer based on two conditions simultaneously
On top of this, the authors added a public leaderboard and division into public and private evaluation parts. The public part is needed for open comparison of results, and the private part is for precise verification against the gold standard and protection against fitting to known answers. Such a format makes comparison of different RAG configurations more fair: one team can test a new retriever, another a new generator, and both get comparable results on a fresh corpus, rather than on a set the model could have already learned.
How verification works
To prevent automatically generated QA pairs from turning into noise, DRAGOn runs them through several filters. First, basic linguistic correctness is checked using RuRoBERTa-large, then questions go through NER verification via Natasha. After that, overly simple examples are removed from the set: if small models like Qwen 2.
5 7B or LLaMa 3 8B answer without relying on context, such a question is not suitable for fair RAG evaluation and is excluded from the final version. Final quality control is done by POLLUX 7B in LLM-as-a-Judge mode. The model evaluates grammaticality, naturalness, correctness, and the question's dependence on context, then these scores are verified against human annotation.
In an experiment with 532 examples, the automatic judge showed high accuracy, though it turned out to be quite strict. After filtering, the authors keep 150 quality questions for each category, then test systems separately for retrieval and generation. In the tests, combinations with Qwen 3 Embedding 8B and E5 Mistral 7B Instruct looked strongest: the conclusion is simple — if the retriever finds the correct context, the generator finds it significantly easier to give an accurate answer.
What this means
DRAGOn is an attempt to turn RAG evaluation from a one-time demonstration into a continuously updated process. For teams building search over documents, news, or internal knowledge bases, such an approach is useful because it reduces the risk of false confidence: a system can give beautiful answers on familiar data, but fail on truly new facts. A dynamic benchmark helps catch this difference earlier and gives a more honest picture of how ready RAG is for work in a live environment.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.