MarkTechPost→ original

Google Introduced Gemini 3.1 Flash TTS — Speech Model with Control, Dialogues and 70+ Languages

Google released Gemini 3.1 Flash TTS — a new TTS model in preview with emphasis on naturalness and control. It supports 70+ languages, natively generates…

AI-processed from MarkTechPost; edited by Hamidun News
Google Introduced Gemini 3.1 Flash TTS — Speech Model with Control, Dialogues and 70+ Languages
Source: MarkTechPost. Collage: Hamidun News.
◐ Listen to article

Google launched Gemini 3.1 Flash TTS in preview — a new speech synthesis model that focuses not simply on voice-over of text, but on managed voice direction. The key difference of the release is that developers can set intonation, pace, accent, and even emotional shifts directly in the text prompt, rather than picking the result blindly through a set of fixed parameters.

For the voice AI market, this is a notable shift: text-to-speech increasingly looks less like a black box and more like a directorial tool. The release is already rolling out in preview mode for developers through Gemini API and Google AI Studio, for corporate clients through Vertex AI, and for Workspace users through Google Vids. According to Google, Gemini 3.

1 Flash TTS scored 1211 Elo points in the Artificial Analysis TTS rating, which is based on blind user comparisons of speech quality. The company calls the model the most natural and expressive in its TTS lineup. Google separately emphasizes the combination of high quality and relatively low cost, meaning the model targets not just demo scenarios but also mass-market product use cases.

The key feature of the update is audio tags — embedded text commands that allow controlling exactly how a phrase is pronounced. Developers can describe a scene, assign a voice profile to a character, add directorial instructions on tone and pace, and then refine individual lines or even parts of a single line through inline tags in square brackets. In other words, the same phrase can sound calm, irritated, whispered, or sped up without switching to a different pipeline. In Google AI Studio, they added configurable controls for this, and ready-made settings can be exported to Gemini API code to maintain consistent sound across projects and platforms.

The second major focus is global scale. Gemini 3.1 Flash TTS supports over 70 languages and bets not only on formal conversion of text to audio, but also on local speech characteristics: accents, dialect nuances, and delivery pace. For product teams, this is especially important in localization, dubbing, voice assistant, podcast, educational video, and audiobook scenarios. Another notable difference is native multi-speaker mode. The model can generate dialogue between two speakers in a single request without breaking the conversation into separate API calls. This should provide a more natural rhythm and consistency of lines than the classical scheme, where each voice is synthesized separately and then stitched together on the application side.

Google also embedded SynthID watermarks throughout all generated audio. They should not be noticeable to the listener, but allow reliably determining that the recording was created by AI. Against the backdrop of growing synthetic speech quality, this is no longer an additional option but a basic security element: the more convincing the voice, the more important the ability to machine-check its origin.

At the same time, the model is currently in preview, and it has limitations. In the documentation, Google notes that TTS here does not support streaming, long responses over several minutes may lose stability and quality, and in rare cases the service returns text tokens instead of audio, causing the request to fail with error 500. A separate nuance concerns prompts: if the instruction is vague, the model may reject the request or literally voice the service directorial notes.

The conclusion here is simple: Google is trying to turn speech synthesis from a narrow API tool into part of a full-fledged multimodal Gemini platform. Gemini 3.1 Flash TTS is interesting not only because it sounds better than previous versions, but also because it gives developers a clearer and more manageable interface for working with voice. If the company quickly stabilizes long generations and maintains the price-quality balance, it has good chances to establish itself not only in the infrastructure layer but also in creative voice products, where specialized TTS services have so far dominated.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…