Alibaba Releases Qwen3.5-Omni — Native Multimodal Model for Text, Audio, and Video

Q: What is the source?

Originally published on MarkTechPost. Hamidun News processes and adapts the material with AI.

Q: When was it published?

Apr 30, 2026. Reading time: 3 min.

Alibaba has unveiled Qwen3.5-Omni — a new omnimodal model that handles text, images, audio, and video without stitching separate modules together. The series…

Hamidun News Editorial

AI monitoring · MarkTechPost

Apr 30, 2026· 3 min

AI-processed from MarkTechPost; edited by Hamidun News

Alibaba Releases Qwen3.5-Omni — Native Multimodal Model for Text, Audio, and Video — Source: MarkTechPost. Collage: Hamidun News.

◐ Listen to article

Alibaba presented Qwen3.5-Omni — a native omnimodal model that understands text, images, audio, and video in a single architecture and can respond with voice in real time.

How the model is structured

The main idea behind Qwen3.5-Omni is that it is not a set of separate models connected on top of a text kernel, but a unified system designed from the start for multiple data types. Alibaba contrasts this approach with the old multimodal format, where vision or sound were simply "grafted" onto an LLM through external encoders. For developers, the difference matters: native architecture typically maintains better context between channels, more accurately connects speech with images, and scales more easily to real-world scenarios like calls, video analytics, and voice assistants.

In the technical report, Qwen3.5-Omni is described as an omnimodal model with a Hybrid Attention Mixture-of-Experts architecture for two circuits — Thinker and Talker. The first is responsible for understanding and reasoning, the second for streaming voice responses. Qwen states that the model was trained on heterogeneous text-image pairs and more than 100 million hours of audiovisual data. The announced context window is 256 thousand tokens, meaning a single session can include very long conversations, meeting recordings, lectures, screenshots, and video clips without being split into dozens of small requests.

The series comes in several variants: Plus, Flash, and Light. This hints at a familiar product line logic — maximum quality for complex tasks, fast mode for interactive scenarios, and a lighter version for computational savings. Alibaba separately emphasizes real-time operation: Qwen3.5-Omni can stream responses in text and natural speech, and the ARIA mechanism is responsible for more stable and smooth voice generation, which dynamically aligns text and speech units.

Main capabilities of the release

According to the technical report, Qwen3.5-Omni-Plus shows the best results on 215 tasks and benchmarks related to audio and audiovisual understanding, reasoning, and interaction. Qwen separately notes that the model outperforms Gemini 3.1 Pro on key audio tasks and is at a comparable level in comprehensive audiovisual understanding. For Alibaba, this is an important signal to the market: competition in the segment of powerful multimodal models is no longer limited to OpenAI and Google, and Chinese laboratories are claiming leadership precisely in the most complex modes — voice, video, and live dialogue.

Context window of 256k
More than 10 hours of audio in one session
More than 400 seconds of 720p video at 1 FPS
Plus, Flash, and Light variants
Structured captions with scenes and timestamps

Another strong part of the release is working with audio and video descriptions. The report discusses structured scene-level captions: the model can build detailed descriptions with precise temporal synchronization and automatic scene segmentation. This is useful not only for media archives, but also for video search, call analytics, training, accessibility scenarios, and content quality control.

In essence, Alibaba is pushing Qwen3.5-Omni toward a universal understanding layer for any media format, rather than just a "chatbot that also hears." Separately, researchers note the emergence of a new capability called Audio-Visual Vibe Coding. This involves direct coding from audiovisual instructions: the model can interpret not only a text request, but also a voice explanation together with visual context. For now, this is more of a research signal than a ready mass-market product, but the direction is telling. If such modes take hold, a developer could avoid rewriting a bug report into text by hand, and simply show the interface, describe the problem verbally, and get a working draft solution.

What this means

Qwen3.5-Omni shows that the next stage of the AI race is not about yet another text chatbot, but about models that work equally confidently with sound, images, video, and speech in a single stream. For business, this opens the path to more cohesive products: voice agents, meeting analysis, media search, and interfaces that understand not only text, but everything the user shows and says.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation