Loka built a voice agent on Amazon Nova 2 Sonic with sub-second latency
Loka has published the architecture of a voice agent built on Amazon Nova 2 Sonic — an AWS speech model that bypasses the classic ASR→LLM→TTS chain and…
AI-processed from AWS Machine Learning Blog; edited by Hamidun News
Loka published a detailed breakdown of the architecture it used to create a voice agent based on Amazon Nova 2 Sonic — AWS's next-generation speech model. The challenge was straightforward: build a bot that customers won't hang up on after waiting a few seconds.
The Problem Being Solved
Robotic voice in phone bots is not just an aesthetic irritation. For businesses, it means direct losses: the customer hangs up, calls back to speak to a live operator, or switches to a competitor. Brand reputation suffers, support costs rise.
Classical voice systems work through a long chain: speech recognition (ASR) → text conversion → language model → answer generation → speech synthesis (TTS). Latency accumulates at every step. As a result, the pause between the customer's question and the bot's answer is 2 to 5 seconds.
In that time, a person decides the system isn't working and either hangs up or demands a live operator. Loka set out to break this chain and create an agent that responds within the natural pause of conversation, like a live interlocutor. The solution became Amazon Nova 2 Sonic.
What Nova 2 Sonic Does Differently
Nova 2 Sonic is a multimodal speech-to-speech model from AWS that works directly with audio, bypassing separate ASR transcription and TTS synthesis steps. It takes an audio stream as input and generates an audio stream as output without intermediate conversion to text. This fundamentally changes the latency profile:
- Responses begin within 300–500 ms after the user pauses
- The model understands natural interruptions in speech and responds correctly to them
- The system hears intonation and emotional context — and adapts the response tone accordingly
- The feeling of "the system is processing" completely disappears from the dialogue
- Integration with business logic through function calling doesn't interrupt the conversation flow
Nova 2 Sonic is available through Amazon Bedrock, allowing companies on AWS to integrate it without switching providers or completely rebuilding their infrastructure.
Production Architecture
Loka implemented real-time audio streaming with minimal buffering. The system doesn't wait for the user's full statement — it begins processing immediately, allowing Nova 2 Sonic to respond precisely at the moment of a natural pause rather than after prolonged silence.
"Robotic voice is the main reason customers hang up.
It's not a technical problem — it's a trust problem," notes the Loka team.
To access business data in real time — order status, customer history, stock availability — the agent uses real-time function calling. For the customer, this looks like an instant response rather than a noticeable pause while waiting for results. In production, the system demonstrates resilience to interruptions, topic switches, and non-standard pauses — scenarios where classical ASR systems most often fail.
What This Means
Speech-to-speech models remove the main barrier to mass adoption of voice bots — the noticeable latency that destroys the illusion of live conversation. If latency is imperceptible and the voice sounds natural, the boundary between agent and operator blurs. For businesses, this is a direct path to call center automation without harming NPS. Following Nova 2 Sonic, similar models from other providers will enter the market — competition in the voice AI segment is only beginning.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.