3DNews AI→ original

OpenAI launched GPT-Realtime-2 and two more voice models via the API

OpenAI expanded the API with three voice models: the updated GPT-Realtime-2 and two new ones. They let apps recognize speech, synthesize it, and translate conve

OpenAI launched GPT-Realtime-2 and two more voice models via the API
Source: 3DNews AI. Collage: Hamidun News.
◐ Listen to article

OpenAI announced an expansion of voice capabilities in its API — developers now have access to an updated GPT-Realtime-2 model and two new voice models for speech recognition, synthesis, and translation.

Three New Voice Models in API

Three models have been added to the API: an updated GPT-Realtime-2 (an improved version of the existing one) and two completely new models. They are designed for different tasks — recognizing user speech, synthesizing responses with voice, and translating conversations between languages in real time. This means developers can now embed voice interaction directly into their applications without using external speech recognition and synthesis services. Previously, it was necessary to integrate multiple providers — one for recognition, another for synthesis, a third for translation. Now everything is in one place.

What the New Models Can Do

  • Speech recognition (speech-to-text) with support for many languages
  • Speech synthesis (text-to-speech) with natural sound and intonation
  • Real-time conversation translation while preserving context
  • Low latency for interactive applications (streaming)
  • Deep integration with GPT-4 for semantic understanding

The models are trained on large volumes of audio data and show good results in both English and other languages. GPT-Realtime-2 has been updated — improvements in natural speech processing, context understanding, and response speed. Developers will get tools to create applications that hear the user, understand what they're saying, and respond with voice. This is important for voice assistants, call centers, educational applications, and interactive services.

How It Works in Practice

Imagine a language learning application. A student speaks in a foreign language. The API hears this (speech-to-text), sends the text to GPT-4 for checking and correction, then voices the result in natural speech (text-to-speech). All of this happens in real time. Or consider a translator application: a tourist speaks in Russian, the API translates in real time and voices it in English. No delays like in Google Translate.

Availability and Competition

For now, the models are available only through the API for developers. They will not appear in ChatGPT or other OpenAI consumer applications (at least not in the near future). This allows OpenAI to release new capabilities to specialists, refine them on real applications, and then, if needed, integrate them into consumer products. API prices will be higher than text models but lower than competitors (for example, Google Cloud Speech-to-Text). OpenAI competes with Google, Amazon Polly, Microsoft Azure Speech Services, and other cloud platforms. Voice APIs are a competitive field where every millisecond of latency and every percentage of accuracy matter.

Voice interface is no longer exotic — it is becoming the standard for

modern applications.

What This Means

Voice interface is becoming more accessible. Now any developer can add voice communication with AI to their application without costly integration of third-party services. This will accelerate the appearance of voice AI applications on the market and make interaction with services more natural.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.
What do you think?
Loading comments…