MarkTechPost→ original

OpenAI’s WebSocket mode changes the game for voice AI

OpenAI has launched a WebSocket mode for its Realtime API that dramatically reduces latency in voice AI applications. Previously, building a voice agent…

AI-processed from MarkTechPost; edited by Hamidun News
OpenAI’s WebSocket mode changes the game for voice AI
Source: MarkTechPost. Collage: Hamidun News.
◐ Listen to article

Latency is the main enemy of any voice interface. A one-second pause between your phrase and an AI assistant’s reply instantly destroys the feeling of a live conversation and turns the interaction into tedious waiting. OpenAI appears to have decided to tackle this problem head-on by introducing a WebSocket mode for its Realtime API — a technological solution that could fundamentally change the architecture of voice AI applications.

To understand the scale of the change, it is worth looking at how voice AI agents have worked until now. The classic architecture resembled an assembly line with three separate stations. First, the user’s audio was sent to a speech recognition model (Speech-to-Text), which turned sound into text.

Then that text was passed to a large language model like GPT to generate a reply. Finally, the text response was sent to a speech synthesis system (Text-to-Speech), which read it aloud. Each of these handoffs meant a separate API request, a separate network connection, and a separate queue on the server.

Engineers in the industry aptly compared such a system to a Rube Goldberg machine — an overly complex mechanism for accomplishing what seems like a simple task. Total latency could easily reach one and a half to two seconds, and at peak load it could be even higher.

OpenAI’s WebSocket mode offers a fundamentally different approach. Instead of three sequential HTTP requests, the client establishes one persistent WebSocket connection with the server. Through this connection, audio is transmitted in both directions as a continuous stream. The user starts speaking — and the audio data is already flying to the server. The model starts generating a reply — and the synthesized speech is already flowing back to the client, even if generation is not yet complete. This is not just an optimization of the existing pipeline, but its full replacement with a single multimodal model that takes audio as input and returns audio as output, bypassing intermediate text representations.

Technically, this became possible thanks to several factors. First, OpenAI’s models themselves have become natively multimodal — GPT-4o and its successors can work with audio directly, without intermediate transcription. Second, the WebSocket protocol, unlike classic HTTP, supports full-duplex communication: data can be transmitted in both directions simultaneously, which is ideal for simulating natural dialogue. Third, streaming generation makes it possible to start playing back a reply before the model has finished generating it — just as a person begins hearing their conversation partner from the first syllable, rather than waiting for the entire sentence to be completed.

The consequences for the industry are hard to overstate. Voice interfaces have so far remained a niche product largely because of the latency problem. Siri, Alexa, and Google Assistant all suffer from noticeable pauses that make conversation feel unnatural. Reducing latency to a level close to real time opens the door to entirely new scenarios. Telemedicine with an AI assistant that responds instantly to a patient’s words. Educational applications in which an AI tutor conducts a live dialogue without irritating pauses. Game NPCs that answer as quickly as a live actor. Corporate call centers in which an AI operator is indistinguishable from a human in terms of reaction speed.

Still, there is a downside. A persistent WebSocket connection consumes more server resources than one-off API calls, which means the cost for developers may turn out to be higher. In addition, dependence on a single provider — OpenAI — grows stronger: whereas before it was possible to combine the best STT, LLM, and TTS from different companies, now the entire stack is locked into a single ecosystem. This is the classic trade-off between convenience and flexibility, and not every team will choose the former.

It is also worth noting the competitive context. Google, with its Gemini project, is actively developing its own real-time multimodal capabilities. ElevenLabs and other speech synthesis startups are also working on reducing latency. But OpenAI has a strategic advantage: the company controls both the language model and the delivery infrastructure, which makes it possible to optimize the entire data path from the user’s microphone to the speaker.

OpenAI’s WebSocket mode is not just a technical API update. It is a signal that the era of text chatbots is gradually giving way to the era of voice AI agents. And the main barrier on this path — latency — is beginning to crumble. The question now is not whether truly natural voice AI interfaces will appear, but how quickly they will become a normal part of everyday life.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…