Habr AI→ original

DeepSeek V4 Pro vs Claude Sonnet 4.6 on 50 real tasks: where to save, where the risk lies

DeepSeek V4 Pro proved to be 3–4 times cheaper than Claude Sonnet 4.6, but on a test of 50 typical tasks for a Russian developer, it fell short in…

AI-processed from Habr AI; edited by Hamidun News
DeepSeek V4 Pro vs Claude Sonnet 4.6 on 50 real tasks: where to save, where the risk lies
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

A comparison of DeepSeek V4 Pro and Claude Sonnet 4.6 on 50 typical tasks of a Russian developer showed a simple fact: a low price per token does not guarantee the best choice for production. On basic scenarios, the models perform nearly equally, but on tasks with Russian specifications, DeepSeek makes noticeably more errors.

What was tested

The article's author compared the models not on academic benchmarks, but on applied requests that actually occur in local teams: customer support, data extraction from documents, calculations based on Russian Labor Code and Tax Code norms, as well as transcription of professional abbreviations. Testing was conducted through regular web interfaces: Claude Sonnet 4.6 — without adaptive thinking, DeepSeek V4 — in fast mode without deep thinking.

In total, there were 50 prompts divided into four blocks. In April 2026, the price difference looked very aggressive in favor of DeepSeek: $1.74 per million input tokens and $3.

48 per output against $3 and $15 for Sonnet 4.6. On actual load, this gives roughly threefold savings, so the temptation to switch to a cheaper model is quite understandable.

  • Classification of 20 support tickets into five categories
  • Extraction of fields from 15 documents with OCR errors
  • 10 tasks on reasoning with Russian law norms and calculations
  • 5 tasks on local terminology like EDS, UPD, OFD and KIZ

Where parity exists

On simple scenarios, there was almost no difference. Both models flawlessly classified support tickets and performed equally well with typical questions about delivery, returns, payment, and general inquiries. In basic reasoning there was also parity: statute of limitations, advance return, and a case with dismissal during trial period — both systems analyzed correctly, albeit with different references to norms.

The picture was similar in document review. Both models did not confuse OGRNIP with INN, took the amount from a digit line if the text description contained an error, and correctly extracted dates from advance reports. According to the author's assessment, if 80% of a company's load consists of precisely such tasks, switching to DeepSeek can indeed reduce budget by approximately 75% without noticeable quality loss.

"English benchmarks won't help us choose a model for a

Russian task."

Where errors are costly

Problems began where general intelligence is not enough, but knowledge of local context and accuracy on edge cases is required. In a test calculating an employee's salary with a 150,000 ruble salary, Sonnet gave the correct 130,500 rubles in hand, while DeepSeek gave 110,550. In essence, the model withheld 26.3% instead of the standard 13%, probably confusing personal income tax with employer insurance contributions. For demonstration purposes, this is just a mistake, but in an automated pipeline — potentially hundreds of thousands of rubles in error per month.

Another failure was found in OCR normalization. Both models correctly read the amount, INN, and date in an invoice with mixed Russian and Latin characters, but only Sonnet normalized the document number to canonical form. DeepSeek left the letters O and l where digits should be. If such a number is later compared to a 1C or ERP database by exact match, the document simply won't be found, even though the other fields look correct.

The most unpleasant type of error DeepSeek showed in a task about a social deduction for a 25-year-old son's education. The model began the answer with "YES", and then itself explained why under Article 219 of the Russian Tax Code the deduction is not allowed after age 24. For a human, the contradiction is immediately obvious, but for a system that only parses the first word, this is already an incorrect class.

A similar problem surfaced in terminology: Sonnet correctly revealed KIZ as a control identification mark, while DeepSeek invented a variant about "part identification code". In total, Sonnet scored 92% versus 88% on documents, 100% versus 60% on tasks with Russian legal specification, and 100% versus 80% on local terminology.

What this means

The practical conclusion: DeepSeek V4 Pro is well suited for first-line support, templated responses, basic classification, and MVP, where price is critical and an error does not lead to financial or legal action. But if the model participates in money calculations, interpretation of Russian Tax and Labor Code norms, document normalization, or provides answers that are directly parsed by systems, the premium for Claude Sonnet 4.6 looks like insurance against more expensive consequences. Choose between them not by benchmarks, but by 30–50 of your own actual requests.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…