Microsoft Introduces Three Models for Text, Voice, and Image Processing

Q: What is the source?

Originally published on 3DNews AI. Hamidun News processes and adapts the material with AI.

Q: When was it published?

Apr 28, 2026. Reading time: 3 min.

Microsoft AI introduced three new models: MAI-Transcribe-1 for speech-to-text conversion, MAI-Voice-1 for voice synthesis, and MAI-Image-2 for image…

Hamidun News Editorial

AI monitoring · 3DNews AI

Apr 28, 2026· 3 min

AI-processed from 3DNews AI; edited by Hamidun News

Microsoft Introduces Three Models for Text, Voice, and Image Processing — Source: 3DNews AI. Collage: Hamidun News.

◐ Listen to article

Microsoft AI expands its own line of generative services and shows that it no longer wants to rely solely on partner models. The company's research division has presented three new solutions at once: MAI-Transcribe-1 for speech-to-text conversion, MAI-Voice-1 for voice synthesis, and MAI-Image-2 for generating images based on text descriptions. For Microsoft, this is not just another launch, but a bid for a more independent role in the AI platform race. The new lineup closes several key scenarios that are in demand in corporate products and cloud services.

MAI-Transcribe-1 can convert speech to text in 25 languages and, according to Microsoft, works 2.5 times faster than Azure Fast service. This is important for call centers, meeting transcription, customer conversation analytics, and real-time content localization. MAI-Voice-1 generates approximately one minute of audio track in about one second and supports voice customization for various tasks — from interface voiceovers to voice assistants and media production automation. MAI-Image-2 is responsible for creating visual content based on text requests, thus complementing the text and voice functions with a full-fledged visual module.

The launch of three models at once shows that Microsoft is betting not on individual demonstration products, but on its own multimodal infrastructure. Within the company, this work is being conducted by the MAI Superintelligence team, which engages in research into advanced AI systems. The division is headed by Mustafa Suleiman, who came to strengthen Microsoft's AI direction and build a more independent technological circuit.

The logic is clear: if a company has its own models for text, voice, and images, it gains more control over quality, speed, cost, and the pace of product development. For such a corporation, this is also a matter of negotiating position: the less dependent on an external model supplier, the more flexibly one can build product and cloud strategies.

Particular emphasis has been placed on the cost of use. Microsoft is trying to compete not only on quality, but also on computing economics compared to alternatives from Google and OpenAI. Transcription pricing starts from $0.36 per hour. Speech synthesis is priced at $22 per million characters. Image generation is quoted at $5 per million input tokens and $33 per million output tokens. This approach is especially important for businesses that consider not only model capabilities, but also the cost of each scenario — from call processing to automatic media creation.

If the stated metrics are confirmed in practice, Microsoft will be able to promote the new models as a working tool for mass, not just experimental tasks. All three models are already deployed on the Microsoft Foundry platform, and transcription and speech synthesis solutions are also available in MAI Playground. This means the company did not limit itself to a research announcement, but immediately brought the models to practical use by developers and corporate clients.

This move is important because the market is increasingly less interested in standalone lab demonstrations: value appears where a model can be quickly integrated into a product, tested on your own workload, and the final economics calculated. Foundry and Playground precisely close this path from announcement to implementation.

At the same time, Microsoft does not abandon its previous partnership strategy. The company continues its collaboration with OpenAI and maintains the multi-year contract, despite having already invested more than $13 billion in its partner. In essence, Microsoft is building a diversified stack where its own solutions complement partnership ones rather than instantly replacing them. This is similar to a strategy in the hardware business, where critical components are purchased from multiple suppliers to reduce risks and avoid dependence on a single technology line.

The main conclusion is that Microsoft is restructuring its AI strategy toward greater autonomy. The company remains one of the main allies of OpenAI, but now is noticeably more actively developing its own models and infrastructure around them. For the market, this is a signal that competition between major AI players will take place not only over generation quality, but also over speed, cost, and depth of integration into workflows. For Microsoft clients, this likely means a wider selection of tools within a single ecosystem and less dependence on a single model supplier.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation