OpenAI adds GPT-Realtime-2, Translate, and Whisper to the API for voice applications
OpenAI announced three realtime voice models for the API at once. GPT-Realtime-2 offers GPT-5-level reasoning, can call tools, and supports up to 128K context.

On May 7, 2026, OpenAI introduced three real-time voice models to its API: GPT-Realtime-2 for dialogue and actions, GPT-Realtime-Translate for live translation, and GPT-Realtime-Whisper for streaming transcription. The company is clearly moving voice interfaces from a "reply to input" mode to one where the assistant can listen, reason, use tools, and maintain conversation continuity.
Three Models at Once
The main idea behind the release is straightforward: voice in applications should work not as a decorative overlay, but as a full-fledged interface. OpenAI notes that developers are increasingly building three types of scenarios: voice-to-action, where users formulate tasks by voice and the system executes them; systems-to-voice, where software itself informs users about what is happening; and voice-to-voice, where AI helps facilitate conversations between people speaking different languages. The new model lineup was assembled to address this range of scenarios.
- GPT-Realtime-2 — a voice model with reasoning at GPT-5 level, supporting tool calls and longer context windows.
- GPT-Realtime-Translate — real-time speech translation from more than 70 input languages to 13 output languages with minimal pauses.
- GPT-Realtime-Whisper — streaming transcription that writes text as speech occurs, rather than after a sentence is complete.
- Pricing has also been announced immediately: GPT-Realtime-2 costs $32 per 1 million input audio tokens and $64 per 1 million output tokens, Translate — $0.034 per minute, Whisper — $0.017 per minute.
All three models are already available through the Realtime API, and they can be tested in the Playground. This is an important moment: OpenAI is not showcasing a distant concept but releasing a ready-made set of tools for teams building support services, voice agents, real-time translation, meeting notes, and other products involving live speech. For the market, this signals that the tools are ready not only for demos but also for pilots.
What Improved in Dialogue
The most noticeable update is in GPT-Realtime-2. The model can insert brief service phrases like "let me check that" so users understand the system is working on a task. It can call multiple tools in parallel, verbally indicate its actions, recover better from errors and interruptions, and handle notably longer scenarios: the context window has grown from 32K to 128K. For production, this matters far more than a "pleasant voice," because real assistants typically break down on long chains of interactions.
OpenAI specifically emphasizes the model's controllability. Developers can choose the reasoning level from minimal to xhigh, balancing latency and answer quality. Understanding of specialized terminology, proper names, and domain-specific vocabulary — for example, medical terms — has also improved.
In internal evaluations, GPT-Realtime-2 in high mode showed results 15.2% better than GPT-Realtime-1.5 on Big Bench Audio, and in xhigh mode — 13.8% better on Audio MultiChallenge for instruction-following in conversation.
"After tuning prompts, we saw call success rates improve from 69% to 95%," — this is how
Zillow describes early GPT-Realtime-2 tests.
Translation and Transcription
The second model, GPT-Realtime-Translate, targets live multilingual dialogue. It translates speech as conversation unfolds, preserving the speaker's pace and meaning even when people speak with accents, jump between topics, or use industry-specific terminology. OpenAI specifically highlights use cases in support, cross-border sales, education, events, media, and author platforms.
Deutsche Telekom is testing the model for multilingual customer support, while Vimeo demonstrates a scenario where educational video is translated during playback.
The third model, GPT-Realtime-Whisper, addresses a more practical but highly demanded task: quickly converting speech to text. OpenAI positions it as a foundation for subtitles, meeting notes, lecture transcription, live broadcasts, and voice agents that need to continuously understand what users are saying.
At the same time, the company reminds us about protective mechanisms: the Realtime API uses active classifiers, some sessions may be stopped if rules are violated, and developers must clearly inform users when they are speaking with an AI.
What This Means
OpenAI is attempting to occupy not only the chat-model market but also the foundational layer for voice products. If quality and latency truly match the stated metrics, the company gains a strong position in call centers, travel services, educational platforms, and corporate assistants, where stable conversation, pause-free translation, and text appearing at the same moment the user speaks matter more than impressive demos.