AWS Machine Learning Blog→ original

AWS SageMaker and vLLM: real-time streaming speech transcription

AWS introduced a solution for real-time voice applications: voice agents, automatic video captions, and contact center analytics require instant speech transcri

AWS SageMaker and vLLM: real-time streaming speech transcription
Source: AWS Machine Learning Blog. Collage: Hamidun News.
◐ Listen to article

Voice agents, automatic subtitle systems, contact center analytics — they all depend on one thing: instant real-time speech transcription. AWS presented an architecture in which an audio stream is processed synchronously as it is received through a single persistent connection — without delays, without waiting for the end of the recording.

Why the Old Way Broke

The traditional approach is request-response. A user sends complete audio, the system receives it in full, then begins to transcribe. The result comes later.

For asynchronous scenarios (for example, processing an hour-long meeting recording), this is fine. But for voice agents that must respond in real time, this architecture destroys the interaction experience. A user says "Book me a table for eight", waits for the agent's response — and the system is still collecting data, waiting for a pause, making sure the user has finished.

Result: a 2-3 second delay, and the feeling of conversation breaks. Live captions in video broadcasts experience the same pain: request-response latency causes desynchronization with the video, text lags behind speech by several seconds. For contact centers, this means analytics lag behind the conversation, and operator guidance arrives too late to help.

The Solution: Stream Processing on SageMaker AI

AWS SageMaker AI paired with the optimized vLLM framework offers an architecture that changes the physics of the problem. Audio arrives in small chunks, and the model begins to transform them into text as they arrive. The connection stays open, results flow back in real time.

There's no need to wait for the end of the recording. It works like streaming video: the first frames are shown while the rest are still loading. Each audio chunk is processed in parallel with receiving the next — the inference pipeline runs continuously, buffering audio chunks.

vLLM is critical here: it is optimized precisely for this kind of streaming inference. The framework redistributes computations so the processor doesn't wait for all input to arrive. Result: latency in milliseconds instead of seconds, memory requirements per request reduced by 30-50 percent.

"Stream processing changes the physics: instead of one large request — many small, but connected ones.

This distributes computations and keeps latency in the acceptable range".

Where It's Applied

Use cases are numerous:

  • Voice agents and chatbots respond without 2-3 second pauses; the agent hears the first phrase and is already generating a response
  • Live captioning — captions appear almost synchronously with speech, ideal for broadcasts and webinars
  • Contact center analytics — the system analyzes speech as the conversation unfolds, suggests answers to the operator in real time
  • Accessibility tools — applications for hard-of-hearing users deliver text instantly, without delay
  • Automotive interfaces — the voice assistant responds as quickly as the text one

AWS provides this as a managed service through SageMaker — the company doesn't need to deploy GPU clusters itself, tune vLLM for its own hardware, or scale infrastructure during traffic spikes. Pay-as-you-go model.

What It Means

Stream processing of speech is moving out of the category of research projects into production standard. For business, this means reducing the cost of entry into voice interfaces by an order of magnitude — previously you needed your own infrastructure, now it's an API call. For users, voice input gains parity with text: responsive, natural, doesn't require waiting. In the coming years, this will become the baseline expectation from any AI application that works with speech.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.
What do you think?
Loading comments…