OpenAI Pushes Beyond Chatbots With Real-Time Conversational AI Models

GPT Realtime 2, Realtime‑Translate, and Realtime‑Whisper are equipped with advanced reasoning capabilities to handle complex audio requests.

MIT SMR Editors May 08, 2026

Topics

As voice interfaces become central to how users interact with artificial intelligence, OpenAI has introduced a new suite of audio models designed to help developers build real-time conversational applications capable of reasoning, translating, transcribing, and executing tasks during live interactions.

The company on Thursday unveiled three new models through its API platform — GPT-Realtime-2, Realtime-Translate, and Realtime-Whisper — marking a broader push toward AI systems that can respond to spoken requests with lower latency and greater contextual awareness.

In a blog post, the maker of ChatGPT said voice is increasingly becoming a natural interface for users seeking assistance while multitasking, whether navigating airports, driving, or accessing support in their preferred language. The company said the latest models are intended to move voice AI beyond “simple call-and-response” interactions toward systems that can actively manage workflows and conversations in real time.

GPT-Realtime-2, the flagship model in the release, incorporates GPT-5-class reasoning capabilities aimed at handling more complex conversational tasks. OpenAI said the model can maintain context over longer sessions, adapt responses dynamically, and integrate with external tools while conversations are underway.

The company is also expanding the model’s context window from 32K to 128K tokens, enabling developers to build applications that support longer, more coherent interactions. OpenAI said the system is designed to retain specialized terminology, including healthcare vocabulary, proper nouns, and industry-specific language commonly used in enterprise environments.

The Realtime-Translate model supports speech translation across more than 70 input languages, with spoken output in 13. Meanwhile, Realtime-Whisper enables live speech-to-text transcription as users speak.

OpenAI said developers are increasingly building around three emerging categories of voice AI applications: “voice-to-action,” where spoken commands trigger tasks through integrated tools; “system-to-voice,” in which software delivers contextual spoken guidance; and “voice-to-voice,” where AI facilitates multilingual or multi-context conversations in real time.

The company cited several early enterprise use cases. Deutsche Telekom is developing multilingual customer support systems powered by live translation capabilities, while travel platform Priceline is exploring voice-driven trip management experiences that would allow travelers to organize itineraries conversationally.

OpenAI said the models can adjust tone and conversational style based on context, enabling calmer responses during issue resolution, empathetic interactions with frustrated users, or more upbeat confirmations when tasks are completed successfully.

Developers will also be able to choose among multiple reasoning settings — minimal, low, medium, high, and xhigh — depending on whether they prioritize lower latency or deeper reasoning during interactions.

According to benchmarks released by the company, GPT-Realtime-2 (high) scored 15.2% higher than GPT-Realtime-1.5 on Big Bench Audio, a benchmark evaluating audio intelligence in AI systems. GPT-Realtime-2 (xhigh) also improved instruction-following performance by 13.8% on Audio MultiChallenge, a benchmark measuring conversational consistency, context handling, and multi-turn spoken interactions.

OpenAI said the Realtime API includes multiple safety safeguards to reduce misuse risks and allows developers to implement additional controls through its Agents SDK. The company added that developers must clearly disclose when users are interacting with AI systems unless that is already evident from the context.

The platform also includes EU Data Residency support for applications operating within the European Union and falls under OpenAI’s enterprise privacy commitments.

Pricing for GPT-Realtime-2 starts at $32 per one million audio input tokens and $64 per one million audio output tokens. Realtime-Translate costs $0.034 per minute, while Realtime-Whisper costs $0.017 per minute.

Topics

About the Author

Tags:

API platform OpenAI

Topics

Share