36Kr (36氪)→ original

Ant Group unveils Ming-flash-omni 2.0: an open multimodal breakthrough

Ant Group introduced Ming-flash-omni 2.0, a powerful open-source multimodal model. The new release leads in visual understanding and content generation, outperf

AI-processed from 36Kr (36氪); edited by Hamidun News
Ant Group unveils Ming-flash-omni 2.0: an open multimodal breakthrough
Source: 36Kr (36氪). Collage: Hamidun News.
◐ Listen to article

Ant Group has open-sourced the most ambitious multimodal model in its portfolio — Ming-flash-omni 2.0. The company claims that its creation not only rivals Google's Gemini 2.5 Pro, but also surpasses it in several critical benchmarks. The key distinction, however, lies elsewhere: Ming-flash-omni 2.0 is the first in the industry to learn how to generate audio synchronously — speech, background noise, and music simultaneously in a single track. This is not merely a technical achievement, but a transition to a new level of multimedia work.

The emergence of an open multimodal model from a Chinese fintech giant appears to be part of a broader strategy. While western market leaders — OpenAI, Google, Anthropic — keep their most powerful systems closed, companies like Ant Group are beginning to understand that openness can become a competitive advantage. Ming-flash-omni 2.0, released with open source code, gains instant access to a developer community that can adapt the model to local needs, optimize it for their devices, and create specialized applications. This is especially significant for Asian markets, where localization and cultural adaptation are critical.

Regarding technical specifications, the model demonstrates impressive results. In tests for visual-language understanding and image generation with editing, Ming-flash-omni 2.0 shows results that compete with Gemini 2.5 Pro, and in certain benchmarks even surpass it. But visuals and text are already familiar territory for modern large models. The true innovation lies in audio functionality. Until now, when synthesizing speech, developers either generated voice separately or added background sounds and music as separate layers in post-production. Ming-flash-omni 2.0 changes the game: it can simultaneously create all three components, understanding context and ensuring their natural interaction in a single temporal stream.

This opens doors to entirely new use cases. For media production, it means accelerating the creation of voiceovers for video content, documentaries, and podcasts. The system can generate not just an actor's speech, but also enrich the soundscape with atmospheric details. For AI application development, the ability of unified audio generation allows building more complex interactive systems — from smart assistants that sound like real people in the real world, to game scenarios with full-featured sound design created on the fly.

The release of Ming-flash-omni 2.0 with open source code symbolizes a shift in AI geopolitics. While previously, innovations in multimodality were dominated by western giants, Chinese companies now demonstrate that they can not only keep pace, but also move ahead in specific directions. Open access will amplify this effect, allowing developers worldwide to experiment and improve the model. The question is only whether the western industry can quickly adapt to the new reality, where the best tools often lie in open access and are available to everyone, not just those who can afford Tier-1 cloud computing from major companies.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…