Pollux by Sber AI: an LLM judge for evaluating Russian-language models
Sber AI has released Pollux, an LLM judge for evaluating the quality of Russian-language models. The tool addresses the critical problem of validating language

Sber AI unveiled Pollux — a judge model for automatic evaluation of Russian-language language models. The tool solves a problem that developers have faced for years: how to reliably and quickly check the quality of an LLM before deploying it to commercial production.
From Manual Checks to Automation
Several years ago, when language models first began generating reasonable answers, quality assessment was purely a matter of time and money. People manually checked every model response, noted errors, evaluated compliance with instructions, and verified factual accuracy. The process was slow: checking hundreds of responses took days or weeks.
Today, LLMs solve serious tasks — write working code, conduct customer conversations, plan delivery routes. But before deploying to a real product, the model still needs to be evaluated. Manual checking became a bottleneck in development. Companies lose time while experts manually verify responses.
Pollux: A Solution for the Russian Language
Pollux solves this problem. It is a specialized language model trained in Russian and on the task of evaluating other LLMs. It can work in your development pipeline and automatically check the quality of responses. The model is released as open source — developers don't pay licenses and don't sign contracts. You simply download it, embed it in your code, and use it.
How the Judge Model Works
Pollux checks language model responses against several criteria: information accuracy, answer completeness, compliance with required style, adherence to original instructions, and relevance to context. It works millions of times faster than a human — evaluation happens in seconds instead of hours of manual work. It scales: you can check even thousands of responses at once. It costs less. Where you once had to pay an expert for each checked response, the model now does the counting for free.
One reason why Sber released the tool openly is to give the entire ecosystem a standard evaluation method. The model is trained in Russian. This is important — evaluation criteria are often language-specific. Russian has more flexible grammar, more complex grammatical rules, and style depends on context. Accuracy of checking in Russian is higher than if you took a model trained in English.
Industry Standardization
Until now, each company had its own criteria for evaluating LLMs — often improvised and incomplete. One developer checks by five criteria, another by fifteen. Results are not comparable to each other. Pollux creates a unified standard. A common tool emerges that everyone can apply to their models. This will simplify comparison of LLMs with each other and reduce risks before launching to production.
For the Russian-speaking AI community, this is especially meaningful — most evaluation tools are oriented toward the English language and English-language context. With Pollux, Russian-speaking developers get a tool adapted to their realities.
What This Means
Automatic LLM evaluation becomes a development standard, not an expensive luxury. Developers will be able to iterate faster, experiment with architecture and data, without waiting in line for experts. The development cycle will accelerate several times over.
For users, this means higher quality and more reliable AI services, because models are better tested before release to production. The Russian-speaking developer community finally gets a tool adapted to the peculiarities of their native language.