xAI launches separate Grok APIs for speech recognition and synthesis for corporate developers

xAI has launched separate Grok APIs for speech recognition and synthesis, selling its voice stack as standalone infrastructure for the first time. STT supports 25+ languages, streaming and batch modes, while TTS provides five voices and speech tags for intonation; xAI's pricing is aggressive — from 10 cents per hour for batch transcription and $4.20 per million characters for synthesis.

Khamidun Zhemal

AI monitoring · MarkTechPost

Apr 27, 2026· 3 min

AI-processed from MarkTechPost; edited by Hamidun News

xAI launches separate Grok APIs for speech recognition and synthesis for corporate developers — Source: MarkTechPost. Collage: Hamidun News.

◐ Listen to article

xAI has released separate APIs for speech recognition and synthesis, transforming Grok's voice capabilities from an internal product into a standalone infrastructure service for developers. The two services in question are Speech-to-Text and Text-to-Speech, which operate on the same technological foundation already used in mobile Grok, Tesla vehicles, and Starlink support. For xAI, this is not simply another API feature, but a direct entry into the voice platforms market, where ElevenLabs, Deepgram, and AssemblyAI have already established themselves.

From a practical standpoint, xAI places its primary emphasis on enterprise scenarios. The Speech-to-Text API supports batch processing and real-time streaming transcription. According to xAI's documentation, batch mode costs $0.

10 per hour of audio, while streaming mode costs $0.20. The service works with more than 25 languages and can not only convert speech to raw text but also structure the output: placing numbers, dates, currencies, and other elements in proper written form.

For teams building call centers, voice assistants, meeting transcription services, or telephone automation, this is more important than it may appear at first glance: after such normalization, the text is easier to index, analyze, and send downstream into LLM chains. The STT component also includes a set of features clearly designed for production workloads. xAI claims support for 12 audio formats, files up to 500 MB, word-level timestamps, diarization for speaker separation, and multichannel mode for channel-based recording.

In other words, the service is designed not only for simple voice notes but also for negotiations, podcasts, client calls, and complex multichannel recordings. xAI separately emphasizes the quality of entity recognition in phone conversations—names, dates, account numbers, and other sensitive details that typically compromise the accuracy of conventional ASR systems. The strongest part of the announcement is the price and quality positioning against competitors.

According to xAI's own benchmarks, Grok STT showed 5.0% error on the entity recognition task in phone conversations versus 12.0% for ElevenLabs, 13.

5% for Deepgram, and 21.3% for AssemblyAI. On the general dataset, xAI reports a 6.

9% word error rate. These figures should for now be understood as internal statements from the company itself, not independent industry assessment, but even in this form the message is clear: xAI wants to sell not "another voice API" but a more accurate system for business communications where names, amounts, dates, and legal terminology are critical. The second service, Text-to-Speech, complements this strategy and is also presented as a tool for developers, not simply a demonstration voice effect.

xAI priced synthesis at $4.20 per million characters and opened access to it via a standard REST API and WebSocket for real-time generation. TTS includes five voices, support for 20 languages, and several output formats—from standard MP3 to PCM and telephone mu-law and A-law.

The key feature is speech tags: a developer can insert control markers into the text such as whisper, pause, laughter, accent, or rate slowdown. This makes the API suitable for voice agents, IVR scenarios, educational products, and media formats where dry "robotic" synthesis no longer satisfies the market. It is also important how xAI structures its voice lineup.

Previously, the company promoted Grok Voice and the voice agent API as a unified conversational interface. Now it sells STT and TTS separately, allowing companies to build their own stack: recognize incoming audio stream separately, synthesize responses separately, and keep LLM logic in-house or connect through another service. For enterprise developers, this significantly lowers the integration barrier, since there is no need to immediately adopt the entire xAI voice stack all at once.

The conclusion is straightforward: xAI is attempting to occupy a position not only in the chatbot race but also in the more applied segment of voice infrastructure. If the claimed prices, latencies, and quality are confirmed in real-world deployments, the company has a chance to quickly enter enterprise use cases—from customer support to internal voice assistants. However, the market will ultimately judge not by the announcement but by API stability, transparency of limits, quality across different languages, and how well this system performs outside xAI's own demos and benchmarks.

Hamidun News

AI news without noise. Daily editorial selection from 50+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation

xAI launches separate Grok APIs for speech recognition and synthesis for corporate developers

Want to stop reading about AI and start using it?

The AI world, distilled — once a week