EMNLP 2025: Why neural networks now check themselves (and why it's hard)

Q: What is the source?

Originally published on Habr AI. Hamidun News processes and adapts the material with AI.

Q: When was it published?

2026-02-03. Reading time: 3 min.

На конференции EMNLP 2025 стало окончательно ясно: эпоха, когда качество перевода оценивали только люди, подходит к концу. Теперь индустрия делает ставку на авт

Hamidun News Editorial

AI monitoring · Habr AI

2026-02-03· 2 min

AI-processed from Habr AI; edited by Hamidun News

EMNLP 2025: Why neural networks now check themselves (and why it's hard) — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

Imagine you wrote a complex essay, but instead of having a strict teacher with a red pen check it, a slightly more well-read classmate does. That's roughly what the machine translation industry looks like now. The EMNLP 2025 conference in Miami once again confirmed the main trend of recent years: we've completely delegated neural network evaluation to other neural networks. While evaluating translation quality used to require crowds of linguists and experts, this process now falls on the shoulders of large language models. It's not just a matter of saving money; it's a question of survival in a world where there's too much data for human perception.

The Yandex team came to the conference not just as listeners. They brought two serious papers that show how the approach to text evaluation is changing. Katya Enikeeva, who leads translation analytics, emphasizes an important nuance: teaching a model to translate is only half the battle. It's much harder to teach it to understand where exactly it made a mistake. This requires a completely different level of reflection from an LLM. The model must play the role of a critic who sees not just grammar, but meaning distortions, loss of style, or inappropriate tone. Such solutions are now determining how seamlessly video translation will work in your browser or search across foreign websites.

Why is this important right now? We've hit the ceiling of classical metrics. The old tried-and-true algorithms like BLEU, which simply compared the number of identical words in the original and translation, no longer work. They don't understand irony, don't see context, and easily give high scores to text that completely distorts the meaning. They're being replaced by complex pipelines where one model analyzes the source, a second analyzes the result, and a third delivers the verdict. This creates a kind of intellectual ecosystem where quality grows through constant internal auditing. At EMNLP 2025, it became clear that those who fail to build such evaluation systems will hopelessly fall behind in the race for generation quality.

Yandex presented its work in two key sections: the Findings of the main conference and at the WMT workshop. This is recognition by the global community that Russian engineers set the bar in one of the most complex areas — automatic quality assessment. It's important to understand that behind the academic names of papers lie quite practical things. When you open a page in Chinese and a second later read coherent text in Russian, behind this stands not only a powerful translation model, but an even more powerful control system that in real-time filters out hallucinations and errors. Without this control, we'd still be reading "superbrain" translations from the early 2000s.

The industry is moving toward complete automation of the learning cycle. In an ideal world (which has almost already arrived), one model generates data, another model evaluates it, and based on this evaluation, the first model gets retrained. Humans remain here only in the role of supreme judge who sets the general rules of the game. However, this conceals the main trap: if the evaluating model starts making mistakes or rewarding "beautiful lies," the entire system will collapse. The problem of hallucinations in evaluation — this is the next major challenge that was discussed widely in the hallways of the conference. We're teaching neural networks to be honest critics, but they're still trying to be just convenient conversation partners.

The bottom line: the era of manual data labeling has officially become an elite and very expensive hobby. The future lies with automatic metrics based on LLMs, and Yandex's work at EMNLP 2025 shows that we're at the forefront of this process. Will neural network critics surpass humans in understanding context by this year?

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation