deepvk USER2-base model nearly matched OpenAI in a case-law embedding benchmark
On a corpus of 858 rulings from the Intellectual Property Court, the local deepvk USER2-base delivered results nearly on par with OpenAI…
AI-processed from Habr AI; edited by Hamidun News
The local Russian-language model deepvk USER2-base nearly matched OpenAI and Voyage in an embeddings test for case law search. On a corpus of 858 intellectual property decisions, the benchmark author concluded that a narrow legal RAG does not always need an expensive API, and that the value of a reranker depends heavily on the strength of the base model.
How the test was set up
For the evaluation, they assembled a narrow but practical corpus: 858 decisions from the Court for Intellectual Property Rights and the texts of Part IV of the Civil Code of the Russian Federation. The models were tested on 30 questions of varying difficulty—from standard disputes over counterfeit goods on marketplaces to cases involving patents, trademarks, domains, and copyright on social networks. Importantly, the scoring was based not on the reranker’s final output but on each model’s raw top-20: the results of seven embeddings were merged, deduplicated, and then annotated.
This let the author avoid a bias where unprocessed documents automatically receive a zero score. The labeling was done through NotebookLM and then spot-checked manually. In total, this produced 2,751 “question–case” pairs, each scored from 0 to 2.
The main metric was nDCG@5, because for a real user the quality of the whole top-5 matters more than just the first relevant hit. MRR was also calculated, and a paired bootstrap with 2,000 iterations was run. The author frankly describes the test as a pilot: 30 questions is too few, so some of the differences between models remain within statistical noise.
Who came out ahead
The top group included OpenAI text-embedding-3-large, Voyage voyage-3, and local deepvk USER2-base. On this sample they were statistically indistinguishable from one another, although they clearly outperformed Yandex and some models from the middle group. The main takeaway is not that an absolute winner emerged, but that a free local Russian-language model ended up in the same league as commercial API.
“USER2-base is the test’s main find.”
- Top three: OpenAI, Voyage, and USER2-base
- USER2-base without a reranker posted nDCG@5 of 0.773
- The USER2-base + jina-reranker-v3 combo rose to 0.797
- OpenAI without a reranker scored 0.809, meaning the gap remained within the margin of error
- A hybrid of OpenAI and USER2-base expanded coverage of “ideal” cases from 33% to 49%
The last point is especially interesting for RAG pipelines. Different embeddings surface different documents, so a hybrid candidate pool noticeably broadens coverage. But the author separately notes that this is still an oracle analysis, not a fair test of production output: if ranking is weak, the right documents will still sit in positions 10–15. To confirm the effect in production, a separate test with Reciprocal Rank Fusion and final nDCG is needed.
Where the reranker helps
With rerankers, the picture was less obvious. Of the four models, the author calls jina-reranker-v3 and bge-reranker-v2-m3 the only ones that really worked for a Russian legal corpus, with jina looking slightly better on average. On this dataset, mxbai-rerank-base-v2 noticeably hurt results, while the English mmarco was almost neutral.
The practical conclusion is simple: you cannot choose a reranker “by default” just because it is popular in the English-language stack. The reranker’s effect depended heavily on the quality of the original embedding. On strong models like OpenAI, Voyage, and USER2-base, the gains stayed within the margin of error.
On weaker ones, the benefit was already clear: Yandex rose from 0.630 to 0.755 with bge, while Cohere went from 0.
700 to 0.793 with jina. In terms of indexing time, almost all models finished the full corpus in 7–15 minutes, whereas Yandex took about 2.
5 hours because of API limits. As a result, the author plans to put USER2-base and jina-reranker-v3 into the bot, while keeping bge as a fallback if there is not enough hardware.
What it means
For Russian-language vertical RAG systems, this is a strong signal: local models can already compete with major API in narrow domains if they are tested on a real corpus rather than on averaged benchmarks. Another takeaway is that a reranker is not a magic button: its value appears where the base embedding does not rank well enough on its own.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.