OpenAI and MRC: Ultra-Fast Networks for AI

⚡

Key Takeaways

1OpenAI has collaborated with AMD, Broadcom, Intel, Microsoft, and NVIDIA to create the MRC protocol, enhancing GPU network performance.

2The MRC enables the construction of high-speed multi-plane networks, reducing congestion and increasing resilience to failures.

3By distributing packets across multiple paths, MRC minimizes the impact of network failures, accelerating the training of AI models.

💡Why it matters — MRC optimizes the efficiency of supercomputers, which is crucial for supporting the rapid growth of large-scale AI models.

A Collaboration for Faster Networks

To meet the growing demands of AI model training, OpenAI has partnered with technology giants such as AMD, Broadcom, Intel, Microsoft, and NVIDIA. Together, they have designed the MRC (Multipath Reliable Connection), an innovative protocol aimed at enhancing the performance and resilience of GPU networks in massive training clusters. This protocol has been made public through the Open Compute Project (OCP), allowing the entire industry to benefit from it.

With over 900 million people using ChatGPT each week, OpenAI's systems have become essential infrastructure for AI, supporting the creation of increasingly powerful models. Before the advent of Stargate, OpenAI developed and maintained three generations of supercomputers in close collaboration with its partners, which underscored the need to simplify and optimize every level of the architecture, including network design.

The Goals of MRC

The release of the MRC specification is part of OpenAI's broader strategy to establish shared standards for key infrastructures. These standards facilitate the evolution of AI systems in a more efficient and reliable manner while integrating a wide range of partners. This article explores the design of MRC, highlighting its capabilities to:

Build high-speed multipath networks to create redundancy against network failures while using fewer components and less energy.
Virtually eliminate central congestion through adaptive packet spraying.
Utilize static source routing to circumvent failures and eliminate certain routing failures.

These innovations enable OpenAI to deliver improved models more quickly.

The Need for a New Network Design

Training large AI models involves millions of data transfers at each step. A single delayed transfer can disrupt the entire process, rendering GPUs idle. Network congestion and link or device failures are the primary causes of these delays.

As clusters grow larger, these issues become more frequent and complex, making network technology crucial for the design of Stargate. To support the current scale of the Stargate supercomputers, two major challenges had to be addressed:

Reduce the likelihood of network congestion.
Minimize the impact of network failures on training.

The Response: the MRC Protocol

OpenAI's goal was not only to build a fast network but also to ensure predictable performance even in the event of failures, allowing training to continue uninterrupted. To achieve this reliability, OpenAI collaborated with AMD, Broadcom, Intel, Microsoft, and NVIDIA for two years to develop a new method for building and operating networks. The result of this effort is the Multipath Reliable Connection, or MRC.

MRC is a network protocol integrated into the latest 800 Gb/s network interfaces, allowing a single transfer to be distributed across hundreds of paths, circumventing failures in microseconds and simplifying network control plans. It extends RDMA over Converged Ethernet (RoCE), a standard from the InfiniBand Trade Association (IBTA) that facilitates direct remote memory access between GPUs and CPUs.

The Foundation: Multipath Networks

Building resilient networks requires a topology with sufficient redundancy so that all flows can operate smoothly, even in the event of link or switch failures.

Instead of treating each network interface as a single 800 Gb/s link, it is divided into several smaller links. For example, an interface can connect to eight different switches, thereby creating eight parallel networks, each operating at 100 Gb/s, rather than a single 800 Gb/s network.

The Benefits of MRC

Reduced costs and energy consumption.
Improved path diversity.
Ability to connect over 131,000 GPUs with just two levels of switches.

However, fully leveraging this path diversity can be complex. Traditional network protocols for AI training typically require each transfer to follow a single path for packets to arrive in order. In a large multipath network, this can lead to congestion issues.

The Transformation Brought by MRC: Packet Distribution

MRC fundamentally alters this model. Instead of assigning a transfer to a single path, MRC distributes the packets of a single transfer across hundreds of paths throughout the network. Packets may arrive out of order, but each MRC packet includes its final memory address, allowing the destination to deliver them to memory as soon as they arrive.

This combination of multipath topology, distribution, load balancing, and reduction enables an MRC connection to detect and circumvent network failures in microseconds, thus minimizing the impact on synchronous training jobs. In comparison, a conventional network might require several seconds, or even tens of seconds, to stabilize and bypass failures.