OpenAI and Grok Lose to Custom RAG in Legal Agentic RAG Challenge
In the Legal Agentic RAG Challenge, a team compared ready-made solutions from OpenAI and Grok, CAG, BM25, and a custom hybrid pipeline. The finding: even…
AI-processed from Habr AI; edited by Hamidun News
The article authors analyzed how their team participated in the Agentic RAG Legal Challenge—an international competition focused on answering questions based on legal PDFs. The main conclusion proved disappointingly practical: victory is determined not by a loud model name, but by the ability to accurately ground answers to the correct page of a document.
How Systems Were Evaluated
In the challenge, over 300 teams participated, and the corpus consisted of real court decisions, laws, and regulatory acts of the DIFC in English. Participants were first given 30 documents and 100 questions for warm-up, then almost 300 documents and 900 questions for the finals. The questions themselves were of different types: dates, numbers, names, lists, yes or no, and short free-form answers. But more important than the accuracy of the formulation was the Grounding metric—whether the indicated pages matched where the answer was actually taken from.
"Even a perfect answer becomes zero if you indicated the wrong page."
Based on this, the team built its own pipeline: PDFs were converted to Markdown, split into semantic chunks, context was added for each chunk, dense and sparse embeddings were computed, and everything was stored in Qdrant. Part of the work was performed locally on a Mac Studio M3 Ultra. The machine quickly handled parsing 30 PDFs and local embeddings, but generating context for chunks proved too slow: due to a long prefill, each chunk took 15–20 seconds, so this stage had to be moved to an external API.
Who Failed First
The authors first tested the laziest approach—simply uploading documents to OpenAI's built-in knowledge base. The solution looked decent on paper, but in metrics it gave a Total of 0.362: answers were often good, but page citation broke everything. They then tested CAG, where almost the entire corpus is sent to the model at once, without chunked search. An experiment with Qwen 3.5 Flash and a context of up to 1 million tokens showed that CAG is not useless: accuracy was high, but Grounding let it down again. Plain BM25 performed even worse and turned out to be the weakest attempt. From this run, several unpleasant but useful conclusions emerged:
- built-in knowledge bases from major players do not guarantee good citation;
- CAG can answer accurately, but without careful page grounding loses on the final score;
- classic BM25 alone can no longer handle complex legal questions;
- hybrid RAG with proper reranking proved stronger than OpenAI and Grok's built-in solutions.
Their own MORAG system also didn't impress at first: a small local Qwen struggled to maintain accuracy and especially collapsed on multi-document questions. The breakthrough came after switching to Grok via OpenRouter and stricter chunk selection. In warm-up, the team rose from a Total of 0.362 in early runs to 0.780 in the final attempt, and Grounding grew from approximately 0.45 to 0.90. This growth, not the replacement of one trendy model with another, became the main factor in progress.
What Really Helped
The most significant gains came not from abstract "quality improvements," but from several very concrete engineering solutions. The team split reasoning and non-reasoning modes by question types, added an agentic loop with repeated search if data was insufficient, and separately built a gold set to verify answers across 900 questions. This allowed them to avoid shooting blindly in the finals and quickly find systemic errors like misinterpreting language about a submitted but rejected appeal.
- reasoning models were kept for boolean, name, and names, where non-reasoning lost 8–16% accuracy;
- for date, number, and free_text, they used a faster non-reasoning mode without notable loss;
- they added the first 1–3 pages of documents mentioned in the question to the search, because key case details often lie there;
- they rebuilt summaries and sparse vectors for the legal domain;
- they constrained chunks to the FRIDA embedder limit, which cuts anything longer than 512 tokens.
In the final phase, MORAG fell short of the prepared golden submission on overall Total—0.603 versus 0.631, but outperformed it on three of five metrics: on the accuracy of deterministic answers, on the quality of free-form answers, and on speed. The loss came again from Grounding. This is an important nuance: the RAG system itself was already answering better than the "manual" baseline, but the technical grounding of the answer to the correct page still lagged.
What This Means
This story illustrates well that CAG did not kill RAG, Mac Studio is suitable for parts of a local pipeline, and OpenAI and Grok's ready-made bases do not replace tuning to a specific corpus. If the data is complex, victory goes not to the loudest brand, but to the team that knows how to measure errors, control chunking, and bring Grounding to a working state.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.