Yandex SpeechKit and CosyVoice compared on voice bot and audio podcast tasks

Q: What is the source?

Originally published on Habr AI. Hamidun News processes and adapts the material with AI.

Q: When was it published?

May 2, 2026. Reading time: 4 min.

Raft released the second part of its TTS model review and compared CosyVoice with Yandex SpeechKit in two business scenarios: a realtime bot and long-form…

Hamidun News Editorial

AI monitoring · Habr AI

May 2, 2026· 3 min

AI-processed from Habr AI; edited by Hamidun News

Yandex SpeechKit and CosyVoice compared on voice bot and audio podcast tasks — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

Raft released the second part of a TTS models review and this time compared not only open-source solutions but also proprietary services. In focus are two practical scenarios: a voice bot with real-time responses and narration of long texts for audio podcasts.

How they compared

The author kept the same evaluation framework as in the first part of the review so that the results could be directly compared. Two models entered the test: CosyVoice 3-0.5B from Alibaba and Yandex SpeechKit. They were tested not on abstract demos but on tasks where for business it's important not just voice quality but also latency, stability, controllability and ease of implementation. This format makes the comparison useful not for research interest but for choosing a specific tool for a product.

Latency of generation on CPU and GPU
Speech naturalness: timbre, smoothness, tempo and intonation
Expressiveness: emotions and context adaptation
Integration ease: documentation, launch and configuration

For the voice bot scenario, models were run through a short medical dialogue with complex Russian abbreviations like ОМС, СНИЛС, ИБС, ЭКГ and ЭХО-КГ. For the podcast scenario, they used a literary fragment from the story "A Gentleman from San Francisco" of 4868 characters and 728 words. Such a test quickly reveals typical TTS problems: errors in stress, intonation failures, unnatural pauses and artifacts that are especially noticeable over a long distance.

CosyVoice in action

CosyVoice in this review acts as a strong open-source candidate for the Russian language. The author tested version 3-0.5B, and for local deployment used an improved Russian-language fork FastCosyVoice.

In the voice assistant scenario, the model confidently pronounced medical abbreviations, showed no noticeable accent and overall sounded natural. For teams that want to keep the TTS loop within their own infrastructure and not depend on an external API, this is a very important plus. By speed metrics the result was compromise but predictable for a local model.

On a short test phrase lasting about 10-15 seconds, CosyVoice showed latency of 12.25 seconds on CPU and 3.49 seconds on GPU.

For production this means that without a decent graphics card, counting on quick response will be difficult. But by subjective scores the model received 5 points for naturalness and 5 for expressiveness, and that's already a strong argument for tasks where the voice should sound alive rather than like a classic auto-responder. When generating long text, CosyVoice also looked confident: the speech turned out clean, coherent and fairly similar to the voice of the reference speaker.

But it wasn't completely without caveats — in places there were inaccurate stresses and occasional intonation errors. For audio podcasts this is not a critical drawback but it means that before final publication of narration you still need a person who will quickly review the result and correct disputed places.

"...show how these solutions behave in real product scenarios".

SpeechKit's strengths

Yandex SpeechKit in the review looks like a more mature production tool. The service has clear documentation, a large set of Russian voices, several voice characters and scenarios designed for quick integration. In the voice bot test, the model equally confidently handled phrases with abbreviations, and in long text delivered more stable pronunciation and stress.

The main compromise is elsewhere: the voice sounds slightly more robotic than the best modern TTS, especially if comparing only by naturalness. For business something else is more important: SpeechKit already covers almost the entire applied circuit around speech synthesis and reduces the volume of manual refinement after integration. It's not just the synthesis engine itself but a set of service capabilities that are especially valuable in a bot, call center and any scenarios where you need to quickly launch new voice flows without lengthy engineering setup.

synchronous, asynchronous and streaming synthesis
Realtime API for voice bots
bundle of STT and TTS in one interface
tools for marking stresses, pauses and phonetics
ability to create your own voice from marked recordings

On a short phrase, SpeechKit showed latency of 1.81 seconds, received 4 points for naturalness, 5 for expressiveness and 5 for integration ease. The review also separately analyzes the price: API v1 costs 1342 rubles per 1 million characters per month, and API v3 counts requests in blocks of 250 characters. An example from the article: synthesis of 900 characters in v3 will cost approximately 0.65 rubles. For teams this is a convenient model because the cost of the voice channel and server load can be calculated in advance, even before full launch.

What this means

The comparison shows a fairly clear picture for the Russian-language TTS market. If a team cares about stack control, open-source licensing and more lively sound, CosyVoice looks like a strong option, especially with a GPU available and willingness to handle infrastructure. If you need quick launch, predictable integration and ready tools for a call center or voice assistant, Yandex SpeechKit looks more practical. Choosing TTS now makes sense not by abstract quality but by how the model behaves in a specific product and under specific load.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation