Yandex SpeechKit and CosyVoice compared on voice bot and audio podcast tasks
Raft released the second part of its TTS model review and compared CosyVoice with Yandex SpeechKit in two business scenarios: a realtime bot and long-form…
AI-processed from Habr AI; edited by Hamidun News
Raft released the second part of a TTS models review and this time compared not only open-source solutions but also proprietary services. In focus are two practical scenarios: a voice bot with real-time responses and narration of long texts for audio podcasts.
How they compared
The author kept the same evaluation framework as in the first part of the review so that the results could be directly compared. Two models entered the test: CosyVoice 3-0.5B from Alibaba and Yandex SpeechKit. They were tested not on abstract demos but on tasks where for business it's important not just voice quality but also latency, stability, controllability and ease of implementation. This format makes the comparison useful not for research interest but for choosing a specific tool for a product.
- Latency of generation on CPU and GPU
- Speech naturalness: timbre, smoothness, tempo and intonation
- Expressiveness: emotions and context adaptation
- Integration ease: documentation, launch and configuration
For the voice bot scenario, models were run through a short medical dialogue with complex Russian abbreviations like ОМС, СНИЛС, ИБС, ЭКГ and ЭХО-КГ. For the podcast scenario, they used a literary fragment from the story "A Gentleman from San Francisco" of 4868 characters and 728 words. Such a test quickly reveals typical TTS problems: errors in stress, intonation failures, unnatural pauses and artifacts that are especially noticeable over a long distance.
CosyVoice in action
CosyVoice in this review acts as a strong open-source candidate for the Russian language. The author tested version 3-0.5B, and for local deployment used an improved Russian-language fork FastCosyVoice.
In the voice assistant scenario, the model confidently pronounced medical abbreviations, showed no noticeable accent and overall sounded natural. For teams that want to keep the TTS loop within their own infrastructure and not depend on an external API, this is a very important plus. By speed metrics the result was compromise but predictable for a local model.
On a short test phrase lasting about 10-15 seconds, CosyVoice showed latency of 12.25 seconds on CPU and 3.49 seconds on GPU.
For production this means that without a decent graphics card, counting on quick response will be difficult. But by subjective scores the model received 5 points for naturalness and 5 for expressiveness, and that's already a strong argument for tasks where the voice should sound alive rather than like a classic auto-responder. When generating long text, CosyVoice also looked confident: the speech turned out clean, coherent and fairly similar to the voice of the reference speaker.
But it wasn't completely without caveats — in places there were inaccurate stresses and occasional intonation errors. For audio podcasts this is not a critical drawback but it means that before final publication of narration you still need a person who will quickly review the result and correct disputed places.
"...show how these solutions behave in real product scenarios".
SpeechKit's strengths
Yandex SpeechKit in the review looks like a more mature production tool. The service has clear documentation, a large set of Russian voices, several voice characters and scenarios designed for quick integration. In the voice bot test, the model equally confidently handled phrases with abbreviations, and in long text delivered more stable pronunciation and stress.
The main compromise is elsewhere: the voice sounds slightly more robotic than the best modern TTS, especially if comparing only by naturalness. For business something else is more important: SpeechKit already covers almost the entire applied circuit around speech synthesis and reduces the volume of manual refinement after integration. It's not just the synthesis engine itself but a set of service capabilities that are especially valuable in a bot, call center and any scenarios where you need to quickly launch new voice flows without lengthy engineering setup.
- synchronous, asynchronous and streaming synthesis
- Realtime API for voice bots
- bundle of STT and TTS in one interface
- tools for marking stresses, pauses and phonetics
- ability to create your own voice from marked recordings
On a short phrase, SpeechKit showed latency of 1.81 seconds, received 4 points for naturalness, 5 for expressiveness and 5 for integration ease. The review also separately analyzes the price: API v1 costs 1342 rubles per 1 million characters per month, and API v3 counts requests in blocks of 250 characters. An example from the article: synthesis of 900 characters in v3 will cost approximately 0.65 rubles. For teams this is a convenient model because the cost of the voice channel and server load can be calculated in advance, even before full launch.
What this means
The comparison shows a fairly clear picture for the Russian-language TTS market. If a team cares about stack control, open-source licensing and more lively sound, CosyVoice looks like a strong option, especially with a GPU available and willingness to handle infrastructure. If you need quick launch, predictable integration and ready tools for a call center or voice assistant, Yandex SpeechKit looks more practical. Choosing TTS now makes sense not by abstract quality but by how the model behaves in a specific product and under specific load.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.