Habr AI→ original

WisprFlow, Whisper and GigaAM: who recognizes Russian-English speech better

Voice input for neural network commands and code work is constrained not by speed, but by the ability to understand Russian-English code-switching. A new…

AI-processed from Habr AI; edited by Hamidun News
WisprFlow, Whisper and GigaAM: who recognizes Russian-English speech better
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

Voice input has stopped being just a convenient overlay: for those who communicate with LLMs, work in Cursor, and dictate commands mixed between Russian and English, it becomes a full-fledged interface. In a new review, the author compared applications and models that should understand phrases like "explain in Russian," "open in Cursor," and "check that deploy passed," and showed which solutions are actually suitable for such mixed speech in 2026. The material is based on six months of practical testing.

The focus is not on abstract recognition accuracy by individual languages, but on a more complex scenario familiar to developers, analysts, and active AI users: rapid switching between Russian and English within a single phrase, correct transmission of product names, technical terms and code elements, as well as clear punctuation without lengthy post-processing. It is at this stage that even strong systems often break down: English words turn into Cyrillic, commands lose meaning, and dictated text requires manual editing. In terms of applications, the author compared five options from different categories: WisprFlow, SpeakFlow, Handy, OpenWhispr, and SuperWhisper.

The selection includes cloud and local solutions, paid products and open source tools. One of the main conclusions of the review is that the cloud WisprFlow can already be replaced with a free open source alternative without noticeable loss of quality. For the user, this is not just savings on a subscription, but also greater control over privacy, performance, and settings of the local pipeline.

The author also notes his own contribution to the ecosystem: one of his pull requests was accepted into the main branch of an open source project. The section on models proved equally important. The benchmark included Whisper Large v3, Whisper Turbo, GigaAM v3 from Sber, Canary 1B v2 from NVIDIA, and Parakeet V3.

Whisper remains the baseline for such comparisons, but the article shows that the actual result depends not only on the model itself, but also on how it is run. The author separately compared Whisper Turbo and Large v3 on an RTX 5070 Ti and got an unexpected result: on the Blackwell architecture, running through Vulkan was approximately 50% faster than through CUDA. For a local scenario, this is an important practical detail, because the difference directly affects latency, voice input smoothness, and the overall choice of stack.

Whisper alternatives also no longer look like pure experimentation. According to the author's observations, GigaAM v3 and Canary 1B v2 in a number of scenarios are indeed approaching the leader's level, but their weak points emerge in mixed speech, when an English word needs to be preserved without distortion rather than translated or transliterated. A telling example from the review is a situation where Gemini turns into Jemni.

For an ordinary note this is unpleasant but tolerable; for voice work with AI tools, IDEs, library names and deployment commands, such an error can break the meaning entirely. This is why in technical use, the quality of code-switching handling is more important than an averaged accuracy metric. Another practical conclusion concerns punctuation.

The author notes that the problem of missing commas and periods was solved in 99% of cases with a single text prompt, without LLM post-processors and additional delay. This is an important observation for anyone building a voice workflow around editors, AI chat applications, and notes: the inconvenience often arises not from incorrectly recognized words, but from the fact that the text then needs to be cleaned up through separate processing layers. If punctuation can be stabilized at the level of the basic scenario, voice truly begins to compete with the keyboard not only in speed but also in the convenience of everyday work.

The conclusion from this review is simple: by April 2026, the market for voice input for Russian-English mixing has noticeably matured, but there is still no universal winner. If maximum predictability is needed, Whisper and strong applications around it still set the standard. If locality, price and control over the stack matter, open source solutions already look like a real alternative to cloud services.

And the main criterion becomes not the marketing "accuracy," but the system's ability to smoothly handle live technical speech, where Russian, English, and commands for neural networks sound in a single sentence.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…