OpenAI explained how it rebuilt WebRTC for low-latency voice AI

OpenAI described how it rewrote the WebRTC stack for ChatGPT Voice and the Realtime API so conversations with AI stay smooth, without pauses or stutters…

Hamidun News Editorial

AI monitoring · OpenAI Blog

May 16, 2026· 3 min

AI-processed from OpenAI Blog; edited by Hamidun News

OpenAI explained how it rebuilt WebRTC for low-latency voice AI — Source: OpenAI Blog. Collage: Hamidun News.

◐ Listen to article

OpenAI revealed details on how it redesigned its WebRTC infrastructure for ChatGPT's voice features and Realtime API. The goal was straightforward: conversations with AI should not break down due to network latency, even under global load with hundreds of millions of users.

Why the old approach didn't work

For voice AI, it is not enough to simply recognize speech and quickly generate a response. Conversation must flow at the pace of human speech: without awkward pauses, cut-off interruptions, and second-long delays before the model reacts.

OpenAI writes that at their scale, this comes down to three key requirements: global coverage for more than 900 million weekly active users, fast connection establishment, and low stable round-trip latency for audio.

"Voice AI feels natural only when conversation moves at speech pace."

The problem was that the classical WebRTC approach doesn't mesh well with OpenAI's cloud infrastructure. If each session needs its own public UDP port, then with high concurrency you have to open and secure huge port ranges. This is inconvenient for Kubernetes, complicates load balancing, makes autoscaling more fragile, and increases the attack surface. Meanwhile, the ICE and DTLS sessions themselves remain stateful: packets need to reach exactly the process that owns the particular connection.

Relay plus transceiver

After comparing several options, OpenAI abandoned the scheme where the model acts as a regular WebRTC participant through an SFU. For their workload, one-to-one scenarios are typical: one user talks to one model or one client communicates with one real-time agent. So the company chose the transceiver model: an edge service terminates the client WebRTC connection, then translates media and events into internal protocols for inference, transcription, speech synthesis, tool calling, and orchestration.

The key idea of the new architecture is to separate packet routing from protocol termination. Relay became a lightweight UDP layer on the ingress with a small public network footprint, while transceiver remained the stateful component that holds ICE, DTLS, SRTP keys, and the entire session lifecycle. Relay does not decrypt media, does not negotiate codecs, and does not try to pretend to be a WebRTC peer. It merely reads the minimal amount of metadata from the packet and forwards traffic to where the needed session lives.

The most interesting trick involves the first packet. OpenAI uses the ICE username fragment, or ufrag, and embeds just enough routing information into it for relay to select the cluster and specific transceiver. During signaling, the client receives a shared relay VIP and a fixed UDP port, and the first STUN packet gives the system enough data to send the stream down the correct path right away, without a separate call to an external lookup service. After the route is established, address mapping is additionally stored in Redis for quick recovery after relay restart.

How they reduced latency

Once the public UDP surface was reduced to a small number of stable addresses and ports, OpenAI scaled this same scheme globally. Thus came Global Relay — a distributed set of ingress points that receive packets closer to the user and bring them into OpenAI's network without an extra first hop through a distant region. For signaling, the company uses Cloudflare's geographic and proximity routing, so both the initial HTTP/WebSocket request and the first ICE check arrive at the nearest suitable cluster.

Small and fixed public UDP layer instead of thousands of open ports
First packet routing through data already embedded in the ICE ufrag
Shared UDP socket on the transceiver side instead of a socket per session
Short-lived in-memory state plus Redis cache for fast route recovery
`SO_REUSEPORT`, thread pinning to OS threads, and minimized allocations for high throughput

OpenAI wrote its relay in Go and deliberately kept it narrow in responsibility: it does not terminate the WebRTC session, but only quickly parses the needed headers, updates minimal thread state, and forwards packets further. The company specifically emphasizes that it did not need kernel bypass: careful optimization at the level of `SO_REUSEPORT`, thread pinning, and reduction of unnecessary copying was enough to handle global real-time media traffic with a relatively small relay layer and without abandoning standard WebRTC behavior on clients.

What it means

For users, all of this looks like "voice mode became more responsive," but for the market, something else matters more: OpenAI demonstrated how to build mass-scale voice AI on top of standard WebRTC without custom clients and without the painful sprawl of network infrastructure. This is a good reference point for developers of real-time assistants, voice agents, and products where half-a-second latency already breaks the entire user experience.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation