Hugging Face Blog→ original

Hugging Face and Cerebras launch Gemma 4 for real-time voice AI

On July 1, 2026, Hugging Face and Cerebras introduced an open voice pipeline based on Google DeepMind’s Gemma 4 (31 billion parameters). The pipeline has…

AI-processed from Hugging Face Blog; edited by Hamidun News
Hugging Face and Cerebras launch Gemma 4 for real-time voice AI
Source: Hugging Face Blog. Collage: Hamidun News.
◐ Listen to article

Hugging Face and Cerebras launched an open speech-to-speech pipeline with predictable latency on July 1, 2026, built on Google DeepMind's Gemma 4 language model with 31 billion parameters. This is the first publicly available modular stack for voice AI in which developers prioritize latency stability equally with response quality.

System Architecture

The architecture consists of four independent components, each of which can be replaced without reworking the others:

  • Speech recognition — Nvidia Parakeet
  • Language model — Gemma 4 from Google DeepMind (31 billion parameters)
  • Inference platform — Cerebras
  • Speech synthesis — Qwen3TTS from Alibaba

This approach is fundamentally different from monolithic voice pipelines: when a more accurate ASR model or faster TTS engine is released, it can be swapped into the pipeline without stopping the entire system. This is especially important in the rapidly evolving field of open voice models.

For developers, an interactive demo is available in Hugging Face Space "HF Realtime Voice" and full source code in the huggingface/speech-to-speech repository on GitHub. Any of the four layers can be forked and adapted for specific tasks — from robotic assistants to corporate call centers.

The partnership between Hugging Face and Cerebras is part of a broader trend: inference speed has become as much a competitive advantage as the quality of the base model. For the open-source ecosystem, this means that low latency is no longer an exclusive privilege of closed APIs.

Why P95 Latency Matters

Median latency has long ceased to be a measure of quality: most commercial voice systems fit within acceptable 300–500 ms on average. The real problem is the 95th percentile (P95): that's where multi-second pauses appear that users perceive as the interlocutor "hanging."

The situation is exacerbated in multi-turn dialogues — when models need to call external tools, process images, or stitch together multiple context fragments. Each additional step multiplies the latency, and P95 becomes the Achilles heel of the architecture. Cerebras accelerates Gemma 4 inference so much that tail latencies become predictable — the system can be built with strict response guarantees.

The scale of real-world deployment reinforces this: more than 9,000 Reachy Mini robots are already operating in production on the speech-to-speech pipeline from Hugging Face. It is precisely such industrial deployments that expose the gap between lab benchmarks and real operational latency performance.

What This Means

The open stack on Gemma 4 with Cerebras inference lowers the barrier to entry for teams that need voice AI without proprietary dependencies. Modularity preserves long-term flexibility: each of the four layers is updated independently as better models are released — no need to rewrite the entire pipeline for a single improvement. The public demo and open repository turn the concept into a battle-tested template for developers of robotics, smart devices, and voice interfaces.

Frequently Asked Questions

How many parameters does Gemma 4 have in this pipeline?

The Gemma 4 version from Google DeepMind with 31 billion parameters is used; inference runs on the Cerebras platform, which ensures predictable latency even at the 95th percentile of load.

Where can I try the system?

A demo is available in the Hugging Face Space "HF Realtime Voice," with full source code open in the huggingface/speech-to-speech repository on GitHub.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…