Qwen 3.6 Plus outperforms DeepSeek V4 Pro in Russian benchmark, proves more cost-effective

Q: What is the source?

Originally published on Habr AI. Hamidun News processes and adapts the material with AI.

Q: When was it published?

Apr 29, 2026. Reading time: 3 min.

In the Russian language benchmark, the new DeepSeek V4 Pro fell short of the expected Tier S: 89 points versus 92 for Qwen 3.6 Plus. The economics are even…

Hamidun News Editorial

AI monitoring · Habr AI

Apr 29, 2026· 2 min

AI-processed from Habr AI; edited by Hamidun News

Qwen 3.6 Plus outperforms DeepSeek V4 Pro in Russian benchmark, proves more cost-effective — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

A fresh battle test of six April LLMs on Russian content brought an unexpected result: the new flagship DeepSeek V4 Pro did not become the leader. Qwen 3.6 Plus performed better, having been released earlier and costing less.

Who came out ahead

DeepSeek V4 Pro was expected to deliver Tier S results — over 95 points out of 100. The expectations were logical: the model is large, fresh, with strong results on AIME and SWE-bench and with an emphasis on reasoning architecture. But in a practical test on Russian content, it scored 89 points.

This is a strong result, but not the kind typically expected from a release claiming flagship status for the market. Even more interesting is the comparison within the DeepSeek lineup itself. The Flash version scored 83 points, lagging behind Pro by only 6 points.

Against this background, the re-test of Qwen 3.6 Plus, released 22 days earlier, scored 92 points. As a result, the older model outperformed the latest DeepSeek release not only in text quality but also in overall usefulness for real-world Russian language tasks.

Price versus quality

The main surprise lies not just in the scores, but in the economics. If Pro outperforms Flash by only a few points while costing 13 times more, the choice for production no longer looks obvious. For teams generating large volumes of content, this difference quickly becomes a notable line item in expenses.

In such a scenario, what matters is not the absolute test record, but how much useful output the model delivers per dollar spent. In the updated methodology, the author of the comparison emphasizes precisely this and proposes evaluating models through score-per-dollar. This approach changes the conclusions more dramatically than a conventional ranking by raw scores.

A model may fall slightly short on quality but win in real-world use due to price, speed, and more predictable behavior on long responses. For editorial and product teams, this is far more useful than blindly paying for the most expensive option.

DeepSeek V4 Pro — 89 points with Tier S expectations
DeepSeek Flash — 83 points with noticeably softer economics
Qwen 3.6 Plus — 92 points and leadership in comparison
Difference between Pro and Flash — 6 points with a 13-fold price difference
Key metric for selection — not just score, but score-per-dollar

Why reasoning didn't save it

One of the main hypotheses after the test is that optimization for reasoning does not guarantee strong narrative results. Metrics like AIME and SWE-bench effectively demonstrate a model's capabilities in mathematics, code, and structured reasoning, but are worse at predicting how it will write living, coherent, and convincing text in Russian. For content tasks, rhythm, precision of phrasing, sense of structure, and handling of language nuances matter, not just the ability to correctly break down a task into steps.

Against this backdrop, the methodology updates look not like mere formality, but as an attempt to more honestly bring the test closer to production. Among the changes are max_tokens adjustments, paid re-testing, and stricter evaluation of the practical value of answers. In other words, we're no longer simply comparing "intelligent" models, but models that must consistently solve a specific editorial task within a given budget.

It was precisely under such conditions that it became clear that the newness of a release is no longer an advantage in itself.

What this means

The LLM market increasingly doesn't look like a race of "newer is better." For Russian-language content tasks, the winner is not the loudest model, but the one that better maintains text quality and pays for itself in production. For teams, this is a signal to more often re-test fresh flagships on their own scenarios rather than selecting them solely based on benchmark headlines.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation