Habr AI→ original

Claude Code Raised Legal RAG to 0.791, but ARLC 2026 Final Hit Scaling Limits

In the legal AI challenge ARLC 2026, the author raised the RAG pipeline score from 0.034 to 0.791 on warmup in five days across 17 iterations. Key factors…

AI-processed from Habr AI; edited by Hamidun News
Claude Code Raised Legal RAG to 0.791, but ARLC 2026 Final Hit Scaling Limits
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

Claude Code helped raise Legal RAG to 0.791, but the ARLC 2026 finale hit a scaling wall

The ARLC 2026 case shows how fragile RAG can be in real-world tasks. Over five days, the author, working with Claude Code, raised the result of a legal pipeline from 0.034 to 0.791 on warmup, and then hit a hard scaling wall in the finale.

From bug to breakthrough

The challenge required not just answering questions about court decisions and laws, but accurately specifying source pages. Because of this, grounding became a multiplier for the entire final score: even with strong answers, weak citations nearly zeroed out the score. This is what happened at the start: the first version showed 0.

034, although accuracy on the answer side was already high. The problem turned out not to be in the model or retrieval, but in the output format. The author spent three attempts before noticing a simple error: the doc_id field was sending the filename with .

pdf, while the system expected an identifier without the extension. A single fix raised grounding from 0.05 to 0.

55, and the overall result from 0.034 to 0.438.

The pipeline then reached 0.791 on warmup in 17 iterations. The F-beta math with β=2.

5 also helped separately: it showed that extra pages hurt more than it seems, and each extra link can cost 10–22% of grounding quality.

Architecture and techniques

The best result came from a pipeline that indexed not chunks, but entire PDF pages. This is an important choice for legal RAG: if the metric checks landing on a specific page, chunking complicates reverse attribution and generates noise. For search, a hybrid scheme was used—BM25 plus embeddings with RRF fusion—and OCR was added for scans. On top of this, the author limited the number of pages in the output and separately routed comparison questions, where two documents need to be compared.

  • Page-level retrieval instead of chunks
  • BM25 + embeddings + Reciprocal Rank Fusion
  • OCR fallback for empty or scanned pages
  • Limiting the number of pages in responses by question type
  • Fast deterministic branches for simple cases
"First validate output format. Then improve quality."

A separate line of the case is the role of Claude Code. With its help, the author assembled around 3000 lines of code across seven modules in five days and managed 17 versions instead of the typical 3–5 by hand. The agent accelerated fixes, refactoring, submission runs, and diff checks before sending. But strategic decisions still remained with the human: which metrics to fix first, how to interpret regressions, and when not to touch an already-tuned prompt.

Where it broke

On warmup, the corpus consisted of 30 documents and 100 questions, but in the finale it was 303 documents, 4244 pages, and 900 questions. That's where it became clear that a pipeline that performs well on a small set doesn't have to scale to a larger one. First, a cache bug surfaced: the system incorrectly indexed 30 warmup documents instead of 303 final ones, which caused null answers to spike to 37.

After clearing the cache, the problem went away, but the main collapse remained: the final score dropped by 42%, to 0.457. The root causes turned out to be architectural.

A huge document, DIFC Courts Rules, started polluting the output for many legal queries; consultation papers with the same numbers but different years broke disambiguation; and a regex for law number was substituting answers about penalties with law numbers. An attempt to quickly apply a batch of eight fixes seemed reasonable, but in aggregate it worsened the metric balance: some deterministic accuracy grew, but grounding and overall score declined even more. This breakdown is valuable because it doesn't sell AI-assistant magic.

Claude Code gave speed, but didn't remove the main engineering work: validate format, calculate metrics, test one change at a time, and check the system at a scale close to production. The author's main conclusion is harsh: if the eval set is many times smaller than the production corpus, you're testing not retrieval, but luck.

What this means

For teams building RAG products, this is a good cold shower. Victory goes not to the most complex stack, but to discipline: precise output format, clear metrics, minimal noise in citations, and validation at real scale. AI coding assistants already provide serious speed-up, but for now don't replace engineering thinking and responsibility for architectural decisions.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…