OpenAI Revolutionizes Voice AI with Ultra-Low Latency

⚡

Key Takeaways

1OpenAI has re-architected its WebRTC stack to handle 900 million weekly users, ensuring minimal latency.

2The use of WebRTC allows OpenAI to standardize real-time audio connections, facilitating integration with AI models.

3OpenAI's new relay and transceiver architecture enhances packet routing while maintaining low latency.

💡Why it matters — This advancement by OpenAI optimizes AI voice interactions, which are crucial for large-scale real-time applications.

The Importance of Latency in Voice AI

For voice AI to be perceived as natural, it must operate at the speed of speech. Users immediately notice pauses or interruptions caused by latency. This is particularly crucial for OpenAI's voice ChatGPT, as well as for developers using the Realtime API and agents in interactive workflows. OpenAI strives to minimize these delays to provide a smooth user experience.

To achieve this goal, OpenAI has identified three essential requirements: global reach for over 900 million weekly active users, fast connection setup at the start of a session, and low and stable round-trip latency. These requirements ensure that voice interactions proceed seamlessly.

Re-architecting the WebRTC Stack

OpenAI recently re-architected its WebRTC stack to meet large-scale constraints. The single-port media termination per session does not fit well with OpenAI's infrastructure. Additionally, ICE (Interactive Connectivity Establishment) and DTLS (Datagram Transport Layer Security) sessions require stable properties, and global routing must maintain low first-hop latency.

WebRTC is an open standard for sending audio, video, and data with low latency. While it is often associated with peer-to-peer calls, it also serves as a practical foundation for client-server real-time systems. WebRTC standardizes the complex aspects of interactive media, such as ICE connectivity establishment, encrypted transport via DTLS and SRTP, codec negotiation, and RTCP quality control.

Audio must arrive as a continuous stream, allowing a voice agent to begin transcribing, reasoning, calling tools, or generating speech while the user is still speaking. This capability is essential for the system to appear conversational rather than functioning like a push-to-talk system.

The Role of Justin Uberti and Sean DuBois

The WebRTC ecosystem benefits from significant contributions from figures like Justin Uberti, one of the original architects of WebRTC, and Sean DuBois, creator and maintainer of Pion. Their work has enabled teams like OpenAI's to rely on a proven media infrastructure. Today, Justin and Sean are colleagues at OpenAI, working to bridge WebRTC and real-time AI.

Choosing a Media Architecture

After selecting WebRTC, OpenAI had to decide where to terminate the connection and how to link these sessions to the inference backend. Termination determines session state management, media transport, routing, latency, and failure isolation.

A Selective Forwarding Unit (SFU) is often used for multiparty products, but for 1:1 sessions, OpenAI opted for a transceiver model. This model allows for converting media and events into internal protocols for model inference, transcription, and speech generation.

Implementation with Pion

OpenAI built its transceiver service in Go, using Pion to handle signaling and media termination. This service powers the voice ChatGPT, the Realtime API, and several research projects. The transceiver manages SDP negotiation, codec selection, ICE identifiers, and session configuration.

Deployment Challenges with Kubernetes

Deploying on Kubernetes presents challenges, including port exhaustion. The conventional single-port WebRTC termination model per session requires large ranges of public UDP ports, which are difficult to manage and secure.

OpenAI has separated packet routing from protocol termination, using a relay and transceiver architecture. This approach allows for maintaining a small public UDP surface while routing each packet to the transceiver that owns the corresponding WebRTC session.

ICE and DTLS protocols are stateful, meaning that the process that created a session must continue to receive packets from that session to validate connectivity checks, complete the DTLS exchange, decrypt SRTP, and handle subsequent session changes such as ICE restarts.

Comparing WebRTC Media Architectures

OpenAI evaluated several WebRTC media architectures, including TURN (Traversal Using Relays around NAT) and IP:single port per server. Each approach has its advantages and disadvantages in terms of port management, security, and adaptability to Kubernetes.

OpenAI's relay and transceiver architecture separates packet routing from protocol termination. Signaling always reaches the transceiver for session configuration, while media first passes through the relay, a lightweight UDP retransmission layer.

Port Exhaustion

The first issue was the single-port-per-session model itself. At high concurrency, this means exposing and managing very large ranges of public UDP ports.

Cloud load balancers and Kubernetes services are not designed for tens of thousands of public UDP ports per service. Each additional range adds operational complexity in load balancer configuration, health checking, firewall policy, and deployment security.

Large ranges of UDP ports are difficult to secure as they increase the accessible attack surface and complicate network policy auditing.

They are also unsuitable for auto-scaling. Pods are constantly added, removed, and rescheduled in Kubernetes. Requiring each pod to reserve and announce a large range of stable ports makes this elasticity fragile.

Stateful Collation

Single-port-per-server designs solve the port count issue but introduce a second problem: preserving session ownership across a fleet. ICE and DTLS protocols are stateful. The process that created a session must continue to receive packets from that session to validate connectivity checks, complete the DTLS exchange, decrypt SRTP, and handle subsequent session changes such as ICE restarts. If packets for the same session arrive at a different process, the configuration may fail or media may be interrupted.

This gave us a specific goal: expose a small fixed public UDP surface while routing each packet to the transceiver that owns the corresponding WebRTC session.

Comparing WebRTC Media Architectures

We evaluated several ways to achieve this, including TURN (Traversal Using Relays around NAT), where an edge relay terminates client allocations and retransmits traffic on their behalf.

Approach: Advantages and Disadvantages
- IP:single port per session (also known as native direct UDP):
  - Advantages: Direct client-server media path, no retransmission layer in the data path.
  - Disadvantages: Requires a public UDP port per session, large port ranges are difficult to expose and secure, poor adaptation to Kubernetes and cloud load balancers.
- IP:single port per server:
  - Advantages: Much smaller public UDP footprint than session exposure, a shared socket per server can demultiplex multiple sessions.
  - Disadvantages: Works cleanly on a single host, but not across a load-balanced fleet.
- TURN relay (terminating the protocol):
  - Advantages: Clients only need to reach the address and port of the TURN relay, can centralize policy at the edge.
  - Disadvantages: TURN allocations add configuration overhead, moving or recovering allocations between TURN servers remains challenging.
- Stateless forwarder + stateful terminator (OpenAI relay + transceiver):
  - Advantages: Small public UDP footprint, the transceiver always owns the complete WebRTC session.
  - Disadvantages: Adds a retransmission hop before media reaches the owning transceiver, requires custom coordination between the relay and the transceiver.

Overview of the Architecture: Relay + Transceiver

The architecture we deployed separates packet routing from protocol termination. Signaling always reaches the transceiver for session configuration, while media first enters through the relay. The relay is a lightweight UDP retransmission layer with a small public footprint.

OpenAI Revolutionizes Voice AI with Ultra-Low Latency

Le brief IA que les pros lisent chaque soir

The Importance of Latency in Voice AI

Re-architecting the WebRTC Stack

The Role of Justin Uberti and Sean DuBois

Choosing a Media Architecture

Implementation with Pion

Deployment Challenges with Kubernetes

Comparing WebRTC Media Architectures

Port Exhaustion

Stateful Collation

Comparing WebRTC Media Architectures

Overview of the Architecture: Relay + Transceiver

Brief IA — L'actualité IA en français