OpenAI Revolutionizes Voice AI with Ultra-Low Latency
Le brief IA que les pros lisent chaque soir
Les 7 actus IA du jour, décryptées en 5 min. Gratuit.
Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.
Choisis ton rythme
Gratuit · Pas de spam · Désabonnement en 1 clic
The Importance of Latency in Voice AI
For voice AI to be perceived as natural, it must operate at the speed of speech. Users immediately notice pauses or interruptions caused by latency. This is particularly crucial for OpenAI's voice ChatGPT, as well as for developers using the Realtime API and agents in interactive workflows. OpenAI strives to minimize these delays to provide a smooth user experience.
To achieve this goal, OpenAI has identified three essential requirements: global reach for over 900 million weekly active users, fast connection setup at the start of a session, and low and stable round-trip latency. These requirements ensure that voice interactions proceed seamlessly.
Re-architecting the WebRTC Stack
OpenAI recently re-architected its WebRTC stack to meet large-scale constraints. The single-port media termination per session does not fit well with OpenAI's infrastructure. Additionally, ICE (Interactive Connectivity Establishment) and DTLS (Datagram Transport Layer Security) sessions require stable properties, and global routing must maintain low first-hop latency.
WebRTC is an open standard for sending audio, video, and data with low latency. While it is often associated with peer-to-peer calls, it also serves as a practical foundation for client-server real-time systems. WebRTC standardizes the complex aspects of interactive media, such as ICE connectivity establishment, encrypted transport via DTLS and SRTP, codec negotiation, and RTCP quality control.
Audio must arrive as a continuous stream, allowing a voice agent to begin transcribing, reasoning, calling tools, or generating speech while the user is still speaking. This capability is essential for the system to appear conversational rather than functioning like a push-to-talk system.
The Role of Justin Uberti and Sean DuBois
The WebRTC ecosystem benefits from significant contributions from figures like Justin Uberti, one of the original architects of WebRTC, and Sean DuBois, creator and maintainer of Pion. Their work has enabled teams like OpenAI's to rely on a proven media infrastructure. Today, Justin and Sean are colleagues at OpenAI, working to bridge WebRTC and real-time AI.
Choosing a Media Architecture
After selecting WebRTC, OpenAI had to decide where to terminate the connection and how to link these sessions to the inference backend. Termination determines session state management, media transport, routing, latency, and failure isolation.
A Selective Forwarding Unit (SFU) is often used for multiparty products, but for 1:1 sessions, OpenAI opted for a transceiver model. This model allows for converting media and events into internal protocols for model inference, transcription, and speech generation.
Implementation with Pion
OpenAI built its transceiver service in Go, using Pion to handle signaling and media termination. This service powers the voice ChatGPT, the Realtime API, and several research projects. The transceiver manages SDP negotiation, codec selection, ICE identifiers, and session configuration.
Deployment Challenges with Kubernetes
Deploying on Kubernetes presents challenges, including port exhaustion. The conventional single-port WebRTC termination model per session requires large ranges of public UDP ports, which are difficult to manage and secure.
OpenAI has separated packet routing from protocol termination, using a relay and transceiver architecture. This approach allows for maintaining a small public UDP surface while routing each packet to the transceiver that owns the corresponding WebRTC session.
ICE and DTLS protocols are stateful, meaning that the process that created a session must continue to receive packets from that session to validate connectivity checks, complete the DTLS exchange, decrypt SRTP, and handle subsequent session changes such as ICE restarts.
Comparing WebRTC Media Architectures
OpenAI evaluated several WebRTC media architectures, including TURN (Traversal Using Relays around NAT) and IP:single port per server. Each approach has its advantages and disadvantages in terms of port management, security, and adaptability to Kubernetes.
OpenAI's relay and transceiver architecture separates packet routing from protocol termination. Signaling always reaches the transceiver for session configuration, while media first passes through the relay, a lightweight UDP retransmission layer.
Port Exhaustion
The first issue was the single-port-per-session model itself. At high concurrency, this means exposing and managing very large ranges of public UDP ports.
Cloud load balancers and Kubernetes services are not designed for tens of thousands of public UDP ports per service. Each additional range adds operational complexity in load balancer configuration, health checking, firewall policy, and deployment security.
Large ranges of UDP ports are difficult to secure as they increase the accessible attack surface and complicate network policy auditing.
They are also unsuitable for auto-scaling. Pods are constantly added, removed, and rescheduled in Kubernetes. Requiring each pod to reserve and announce a large range of stable ports makes this elasticity fragile.
Stateful Collation
Single-port-per-server designs solve the port count issue but introduce a second problem: preserving session ownership across a fleet. ICE and DTLS protocols are stateful. The process that created a session must continue to receive packets from that session to validate connectivity checks, complete the DTLS exchange, decrypt SRTP, and handle subsequent session changes such as ICE restarts. If packets for the same session arrive at a different process, the configuration may fail or media may be interrupted.
This gave us a specific goal: expose a small fixed public UDP surface while routing each packet to the transceiver that owns the corresponding WebRTC session.
Comparing WebRTC Media Architectures
We evaluated several ways to achieve this, including TURN (Traversal Using Relays around NAT), where an edge relay terminates client allocations and retransmits traffic on their behalf.
-
Approach: Advantages and Disadvantages
-
IP:single port per session (also known as native direct UDP):
- Advantages: Direct client-server media path, no retransmission layer in the data path.
- Disadvantages: Requires a public UDP port per session, large port ranges are difficult to expose and secure, poor adaptation to Kubernetes and cloud load balancers.
-
IP:single port per server:
- Advantages: Much smaller public UDP footprint than session exposure, a shared socket per server can demultiplex multiple sessions.
- Disadvantages: Works cleanly on a single host, but not across a load-balanced fleet.
-
TURN relay (terminating the protocol):
- Advantages: Clients only need to reach the address and port of the TURN relay, can centralize policy at the edge.
- Disadvantages: TURN allocations add configuration overhead, moving or recovering allocations between TURN servers remains challenging.
-
Stateless forwarder + stateful terminator (OpenAI relay + transceiver):
- Advantages: Small public UDP footprint, the transceiver always owns the complete WebRTC session.
- Disadvantages: Adds a retransmission hop before media reaches the owning transceiver, requires custom coordination between the relay and the transceiver.
-
Overview of the Architecture: Relay + Transceiver
The architecture we deployed separates packet routing from protocol termination. Signaling always reaches the transceiver for session configuration, while media first enters through the relay. The relay is a lightweight UDP retransmission layer with a small public footprint.
Brief IA — L'actualité IA en français
L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.