Amazon Nova Sonic: Three Architectures for Voice Agents
AWS has released a guide to building scalable voice agents with Amazon Nova Sonic. The article covers three architectural patterns for audio processing, ways to
AI-processed from AWS Machine Learning Blog; edited by Hamidun News
AWS has shared recommendations for building scalable voice agents using Amazon Nova Sonic. This is a modern model for processing natural speech in real-time scenarios — from customer service and technical support to appointment booking and personal assistants. The AWS blog breaks down three popular architectural patterns, ways to minimize latency, and practices for integrating multi-agent systems.
Amazon Nova Sonic: a model for dialogue
Amazon Nova Sonic is a compact yet powerful model for voice interaction, available through the Amazon Bedrock API. Unlike large foundation models, Sonic is optimized specifically for low-latency responses and real-time audio stream processing. It can work both directly with audio and with text transcription, depending on the architecture.
The key advantage is integration with tools and external APIs. An agent can not only answer a question but also invoke a function: check order status, book a table at a restaurant, get weather forecast. All of this happens within one conversation, without switching between applications.
Three architectural patterns
AWS describes three main approaches, each with different trade-offs between simplicity and functionality.
Single-turn agentless — the simplest pattern. A user speaks one phrase, the model responds. No state memory, no session management. Works well for FAQ bots and simple reference systems. Fast and reliable, but not suitable for complex processes requiring multiple steps.
Multi-turn with state — the agent remembers conversation context and can conduct multi-step dialogue. For example, hotel booking: "What dates?" → "For how many people?" → "Do you have location preferences?". Here you need to manage the session, save dialogue variables, track which step has been completed. Bedrock AgentCore helps with this.
Multi-agent orchestration — several specialized agents work together. For example, one agent handles tariff questions, another handles technical support, a third handles billing. The main orchestrator decides who to pass the request to. Strands BidiAgent provides clean bidirectional flow — not just voice synthesis in response, but processing a live stream from the user.
Minimizing latency: practice
The main challenge for voice agents is response time. Users notice even 100–200 ms delay between the end of their question and the start of the response. The brain interprets this as unnatural, and the agent begins to seem slow or frozen. AWS recommends several techniques:
- Streaming API instead of batch — don't wait for the full response from the model, send the first voice tokens immediately
- Tool call caching — repeated requests return the cached result
- Session segmentation — the system automatically determines boundaries of logical conversation blocks
- Edge deployment — place the model closer to the end user
What this means
Voice interfaces are becoming the standard for interaction: from smart speakers to corporate call centers. Previously, companies had to assemble such systems from separate pieces. Now AWS provides a ready-made solution: model + tools + orchestration. If you're building a customer service bot or AI assistant — this is a practical guide from firsthand experience.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.