OpenAI Revolutionizes Voice AI with GPT-5 Models

⚡

Key Takeaways

1OpenAI introduces GPT-Realtime-2, a voice model offering real-time reasoning equivalent to GPT-5.

2The new models include GPT-Realtime-Translate for instant translation and GPT-Realtime-Whisper for continuous transcription.

3These innovations enable more natural and accurate voice interactions, with various applications ranging from customer support to education.

💡Why it matters — These advancements from OpenAI strengthen voice AI, making human-machine interactions smoother and more efficient, which is crucial for many sectors.

OpenAI Unveils a New Generation of Real-Time Voice Models

OpenAI has recently launched three innovative models designed to transform real-time voice interactions. These models, named GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper, are capable of reasoning, translating, and transcribing instantly, thus providing an enhanced user experience.

The central model, GPT-Realtime-2, stands out for its ability to use multiple tools simultaneously and adjust the intensity of its reasoning across five different levels. This allows developers to precisely control the depth of information processing, making interactions more personalized and tailored to specific needs.

In addition, GPT-Realtime-Translate ensures live translation, while GPT-Realtime-Whisper handles continuous transcription. All these models are now accessible via OpenAI's Realtime API, paving the way for new applications across various fields.

A Major Advancement in Voice Interactions

While ChatGPT already offers an audio mode, and Google provides similar features with Gemini, the performance of voice models has remained inferior to that of text models. The latter, indeed, benefit from a longer reflection time to process information.

OpenAI believes that this situation must evolve. A modern voice agent must not only understand the user's intentions but also follow the context, adapt to changes, use appropriate tools, and respond adequately—all in real time.

To meet these requirements, OpenAI has developed three new interaction models. The "Voice-to-Action" model allows users to express their needs aloud, with the system reasoning on the request and executing the task. The "Systems-to-Voice" model transforms contextual information into spoken advice, while "Voice-to-Voice" facilitates live conversations across language barriers, already tested by Deutsche Telekom for customer support.

GPT-Realtime-2: A Flagship Model with Advanced Reasoning Capabilities

The GPT-Realtime-2 model is presented as the flagship of this new range, bringing reasoning comparable to that of GPT-5. Designed for dynamic voice interactions, it can maintain a conversation, reflect on requests, call tools, and manage interruptions simultaneously.

Technically, the context window of this model has been extended from 32,000 to 128,000 tokens, allowing for longer and more complex conversations. The model can also use multiple tools in parallel, making these actions audible through introductory phrases like "let me check that". In case of an issue, the model informs the user with messages such as "I'm having trouble with that right now."

OpenAI emphasizes that this model is more effective at handling specialized terminology, proper names, and medical terms than its predecessor. The tone of voice is also adjustable, allowing for a calm tone when resolving issues, an empathetic tone with frustrated users, and an enthusiastic tone after successful actions.

Developers have the option to adjust the intensity of reasoning across five levels: minimal, low, medium, high, and very high. The default setting is "low" to minimize latency during simple requests, while more complex tasks can benefit from increased computational power.

On benchmarks, GPT-Realtime-2 outperforms its predecessor, GPT-Realtime-1.5. At a "high" setting, it achieves 96.6% accuracy on Big Bench Audio, compared to 81.4% previously. On Audio MultiChallenge, which evaluates instruction-following in multi-turn dialogues, the "very high" variant achieves an average success rate of 48.5% compared to 34.7%.

Real-Time Translation and Transcription: Powerful Tools for the Modern World

GPT-Realtime-Translate is a live translation model that supports over 70 input languages and 13 output languages. According to OpenAI, it retains meaning while keeping pace with the speaker, even in the presence of contextual changes, regional accents, and specialized vocabulary. Potential applications include customer support, cross-border sales, education, events, and media.

The GPT-Realtime-Whisper model, on the other hand, is designed for low-latency streaming transcription. It transcribes speech in real time, targeting live subtitles for meetings, classrooms, broadcasts, and events. Teams can use it to generate notes and summaries while conversations continue, build voice agents with continuous speech understanding, and set up faster follow-up workflows for customer support, healthcare, sales, and recruitment.

Flexible Pricing for Accessible Solutions

The three models are now available via the Realtime API and can be tested in the Playground. GPT-Realtime-2 is priced at $32 per million input audio tokens ($0.40 for cached input tokens) and $64 per million output audio tokens. GPT-Realtime-Translate is charged at $0.034 per minute, and GPT-Realtime-Whisper at $0.017 per minute.

The Realtime API supports data residency in the European Union for EU-based applications and is covered by OpenAI's enterprise privacy commitments.