Rutube Moved from Whisper Pilot to Proprietary Subtitles Platform and Speech Recognition
Rutube demonstrated why simply launching Whisper was insufficient for user video subtitles. Following the pilot, the service had to handle millions of new…
AI-processed from Habr AI; edited by Hamidun News
Rutube described how it launched automatic captions for user-generated video: first through a quick pilot on Whisper, then through its own ASR platform. The team came to this after realizing that recognizing speech in a demo and stably processing an entire stream of content are two very different tasks.
Why Whisper Wasn't Enough
At the start, Whisper proved to be a convenient option for testing the hypothesis. It allowed them to quickly build the first service, push captions to production, and understand that users really needed this feature. But after the launch, limitations surfaced that are hard to spot during the pilot phase: the platform receives millions of new videos, some lasting up to 24 hours, audio can be noisy, and the language is unknown in advance. On top of that come requirements for text quality and strict processing speed constraints, because captions need to appear not whenever, but in sync with the service's operational rhythm.
Between "recognize speech" and "provide captions for all content" lies an enormous amount of work.
This gap was precisely the team's main takeaway.
For user-generated video, it's not enough to simply run the audio track through a ready-made model and save the result. You need all the infrastructure around recognition: handling long files, robustness to poor audio, text quality control, queue management, and predictable performance under heavy load. Otherwise, even a good ASR model becomes a bottleneck that can't handle industrial-scale traffic.
What the System Became
In the end, the task stopped being "another ASR-based service" and became a full-fledged subtitling platform. Rutube writes that to achieve this, they had to transition to a microservices architecture and their own speech recognition system. This approach was necessary not for the sake of trendy tech stacks, but for the sake of separation of concerns: one part of the system handles video intake and preparation, another handles recognition itself, and a third handles assembly and delivery of results. At scale, this is critical because it allows you to scale individual components independently and prevents the entire pipeline from breaking due to overload in one place.
For such a platform, several requirements are important at once:
- Accept a stream of millions of new videos without manual intervention
- Process videos up to 24 hours long without pipeline collapse
- Work with unknown languages and noisy user-generated audio
- Maintain text quality sufficient for publication
- Stay within speed and processing cost limits
The transition to in-house ASR makes sense in this context. When a product works on mass UGC, a universal external model helps you get started, but doesn't work well for fine-tuning to real data, infrastructure constraints, and target metrics. Your own system gives more control over speed, quality, resources, and how recognition behaves on edge cases that become the norm for a video platform, not the exception.
How They Achieved Speed
The most striking number in Rutube's story is throughput of around 1200 videos per hour per ASR instance. This is an important benchmark because in production, recognition quality cannot be viewed separately from throughput. If the system produces good text but queues thousands of videos, the user gets little benefit. If the pipeline runs fast but is unstable on long videos or poor audio, the product breaks down in real operation. So architecture here is just as important as the model itself.
Behind this number stands not a single successful algorithm, but a series of engineering solutions: how to slice and feed audio, how to distribute tasks, how to avoid wasting time on inefficient stages, and how to keep resources under control. The economic aspect is also important. The higher the throughput per ASR instance, the easier it is to scale the service without explosive infrastructure cost growth. For platforms with a constant stream of UGC, this is no longer a matter of convenience but basic product economics.
What This Means
Rutube's story illustrates well the boundary between a quick AI prototype and a mature product. A ready-made model like Whisper helps you launch quickly, but a mass-scale service requires its own architecture, quality control, and optimization for real-world loads. For everyone building AI features on top of user-generated content, this is a clear signal: the bottleneck usually isn't in one model but in the entire pipeline around it.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.