Agentic Legal RAG Challenge 2026: How Sparks of intelligence Tested the Limits of Agentic RAG

Q: What is the source?

Originally published on Habr AI. Hamidun News processes and adapts the material with AI.

Q: When was it published?

Apr 30, 2026. Reading time: 3 min.

The Sparks of intelligence team published a breakdown of their participation in the Agentic Legal RAG Challenge 2026—a hackathon focused on answering…

Hamidun News Editorial

AI monitoring · Habr AI

Apr 30, 2026· 3 min

AI-processed from Habr AI; edited by Hamidun News

Agentic Legal RAG Challenge 2026: How Sparks of intelligence Tested the Limits of Agentic RAG — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

The Sparks of intelligence team published a detailed breakdown of their participation in the Agentic Legal RAG Challenge 2026 — an international hackathon focused on legal RAG. This is not a story about a resounding victory, but a useful engineering report on why document search systems tend to fail during context preparation rather than during LLM selection.

How the hackathon was organized

The competition was conducted by EORA AI Applications and Services. Participants needed to build a system that answers questions about documents from the Dubai International Financial Centre (DIFC) courts. The hackathon proceeded in two stages: from March 11-19, 2026, participants worked with 30 documents and 100 questions, and in the finals, which took place March 20-22, 2026, the volume grew to 300 documents and 900 questions.

The prize fund was $32,000, and more than 300 people participated in the competition. The difficulty wasn't only in volume. The organizers deliberately incorporated different answer types: boolean, name, date, number, and free text.

That is, one generation model wasn't enough — the system had to precisely extract facts, maintain context, and not spend too much time and tokens. For free text answers, LLM evaluation was used, and key criteria included accuracy, speed, and processing cost. In essence, participants were being tested not on the ability to "plug in a chatbot," but on the maturity of the entire retrieval loop.

Two versions of the system

The team assembled two architectures on a single stack: Qdrant as the vector database, LlamaIndex for working with indexes and LLM abstractions, and Unstructured — for extracting text from PDFs while preserving structure. After that, the paths diverged.

The first version was maximally practical: chunking by pages with overlap, hybrid search, filtering by metadata and regular expressions. The second version was notably more ambitious: hierarchical chunking, preliminary structure analysis via LLM, and an agent router that selects the appropriate search tool for a specific question.

The simple version divided documents by pages and immediately provided clear grounding.
Search there was built on a mix of vectors, metadata, and regex filters.
The agent version used a router and four tools: metadata search, exact match, document comparison, and hybrid search.
Both schemes applied a reranker to shuffle the top-k candidates and boost relevance.

In practice, the simple architecture proved more robust. It could be assembled quickly, behavior was predictable, and the source of answers was easier to trace. The agent scheme looked stronger on paper but turned out to be more expensive in time: two LLM calls, unstable chunking, and more points of failure. Even after fixing some errors, the team didn't manage to fully run through and tune the entire pipeline. For a hackathon with a hard deadline, this is critical: extra complexity quickly eats up the advantage of a "smart" architecture.

Where everything broke down

The main problem turned out to be chunking. The same splitting template worked differently on different pages, and tiny meaningless fragments had to be simply glued to adjacent chunks. In the simple scheme, regex also got in the way: they accelerated pattern search but easily missed needed cases or produced false positives. A separate issue surfaced around grounding: first, the needed links and metadata weren't loaded properly, then this was fixed, but with grounding growth came a drop in accuracy. A good illustration that retrieval systems are rarely optimized by a single metric without side effects.

"In such tight deadlines, building such a system is practically

impossible without code agents."

The final results only confirmed this. The simple solution reached 0.79 accuracy at 0.63 grounding and demonstrated stable, if not ideal, behavior. The more complex agent version lost on accuracy in the preliminary stage and worked slower, and in the finals it wasn't even submitted due to API errors before the deadline. The authors separately warn of another trap: code agents are useful for wrapping and routine tasks, but in complex settings can substitute real steps with stubs, "magic numbers," or narrow regex hacks that look like solutions but don't withstand real testing.

What this means

The breakdown well illustrates the real status of agentic RAG in 2026. In tasks involving legal documents, it's not the most flashy scheme that wins, but the one where chunking, grounding, metadata, and testing are controlled. For teams building AI search over internal knowledge bases, the conclusion is simple: first you need to build reliable retrieval and measurability, and only then add routers, agents, and complex orchestration.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation