MarkTechPost→ original

Alibaba releases a translator with 2.8-second latency across 60 languages

Alibaba introduced Qwen3.5-LiveTranslate-Flash, a model for simultaneous translation of video and audio. It supports 60 input languages and 29 output languages

Alibaba releases a translator with 2.8-second latency across 60 languages
Source: MarkTechPost. Collage: Hamidun News.
◐ Listen to article

Alibaba released Qwen3.5-LiveTranslate-Flash — a model for synchronous speech and video translation in real time. It translates from 60 input languages and outputs results in 29 languages with a latency of just 2.8 seconds.

What the new translator can do

The key difference from conventional translators — Qwen3.5-LiveTranslate-Flash processes video and audio simultaneously, synchronizing results. The model sees the speaker on screen, hears their words, and transforms them into speech in the target language, preserving natural sound and all emotions. This is not just speech-to-speech text translation.

The model analyzes the video stream to synchronize lip movements of the translated character or avatar — something often used for dubbing films and streaming services like Netflix. Currently the model is available only as an API through Alibaba Cloud Model Studio. Developers connect via WebSocket protocol, which allows working with data streams in real time without delays. Commercial use requires a corresponding license from Alibaba.

Technology with voice cloning

The main innovation of Qwen3.5 — dynamic voice cloning during translation. The model will hear the accent, speech rate, intonation and even timbre of the original speaker and reproduce these characteristics in the translation. The result sounds like a translator with perfect pronunciation and language intuition, not a cold robot. All this works thanks to a multimodal architecture where the neural network processes simultaneously:

  • Audio signal (tone, intonation, pauses, emotions, energy of the speaker)
  • Video stream (lip movements, facial expressions, expression, gestures and body language)
  • Text on screen or in slides (for better understanding of context and technical terms)
  • Customizable keywords (scientific terms, brand names, proper names and abbreviations)

This approach ensures that the translation remains accurate and natural, even if the original speaker talks very fast, uses local slang, jokes or applies complex specialized expressions.

How it will be used

On international benchmarks FLEURS and CoVoST2, Qwen3.5-LiveTranslate-Flash surpassed major commercial solutions from competitors. A response time of 2.

8 seconds makes it suitable for synchronous use: online broadcasts, global conferences, business video calls, corporate presentations. Early versions are already being tested by companies for voice interfaces, intelligent voice assistants and synchronous content dubbing. Video bloggers will be able to export video with automatic translation and lip-sync — really, like in a movie.

Streaming platforms will be able to release content in 29 languages within minutes without post-processing. This is especially interesting for education and science. A professor can teach a lecture in Russian, and students in Japan will hear it in Japanese with the correct pronunciation and intonation of the speaker.

What this means for the industry

Synchronous translation is moving from specialized booth cabins to cloud software. Previously, companies needed synchronous interpreters in headsets, translation booths and special equipment for international broadcasts. Now all of this can be done by an API in minutes.

This is a powerful tool for content globalization. A blogger from Russia can communicate with an audience in Chinese, English and Spanish, without an accent and without hiring human translators. Corporate conferences can be conducted entirely with synchronous real-time translation without breaks.

And the quality of the result already competes with professional translators on salary. Alibaba positions this model as a business tool, but its potential is much wider — from content accessibility for people with disabilities to cultural exchange between peoples.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.
What do you think?
Loading comments…