OpenAI Blog→ original

OpenAI introduced MRC — a network protocol for 100,000-GPU AI training clusters

Through the Open Compute Project, OpenAI released the MRC specification, a new network protocol for training large models. It splits the traffic of a single…

AI-processed from OpenAI Blog; edited by Hamidun News
OpenAI introduced MRC — a network protocol for 100,000-GPU AI training clusters
Source: OpenAI Blog. Collage: Hamidun News.
◐ Listen to article

OpenAI has unveiled MRC — a new network protocol for supercomputers used to train large AI models. According to the company, it is already deployed across all of its largest clusters on NVIDIA GB200, including the OCI facility in Abilene and Microsoft Fairwater systems, and helps maintain performance even when network links and switches fail.

Why MRC Was Needed

Training frontier models relies not only on GPUs themselves but also on the network between them. At each training step, millions of data exchanges occur, and if even a single packet or stream arrives noticeably later than the others, some accelerators begin to sit idle. On smaller clusters, such delays can still be tolerated, but on systems the scale of Stargate, the problem becomes systemic: the more nodes involved, the higher the chance of congestion, latency jitter, and hardware failures.

For OpenAI, this is no longer a secondary engineering challenge. The company notes that ChatGPT is used by over 900 million people weekly, which means computational infrastructure is becoming a foundational service layer. This is why the team, working with AMD, Broadcom, Intel, Microsoft, and NVIDIA over the past two years, has rebuilt the network stack to deliver not just high speed but predictable behavior under load and during partial failures.

How the Network Works

The key idea behind MRC is to not treat the network interface as a single large pipe at 800 Gbps. Instead, OpenAI divides it into several smaller channels: for example, eight 100 Gbps lines, each going to its own switch. This creates a multi-plane network where the same traffic can be routed through many independent paths.

In such a configuration, according to OpenAI's estimates, a network of roughly 131,000 GPUs can be built with just two levels of Ethernet switches, whereas a traditional 800 Gbps design would require three or four levels. The protocol itself then comes into play, extending the familiar RoCE stack for AI training tasks. Instead of sending all traffic along a single route, MRC "scatters" packets of a single transfer across hundreds of paths at once.

Packets may arrive out of order, but this is acceptable because each packet already specifies its final memory address, and the receiver assembles the data in place as it arrives. This allows the network to use available channels more evenly and handles local congestion much better.

  • A single exchange is split into multiple parallel paths across different network planes
  • When signs of congestion are detected, the protocol removes the problematic path and replaces it with another
  • When a packet is lost, MRC quickly assumes failure and retransmits the data
  • If a packet is lost due to receiver-side congestion, packet trimming helps—sending only the header to explicitly request retransmission

OpenAI specifically emphasizes that MRC can bypass network failures at microsecond scales, whereas traditional fabrics might require seconds or even tens of seconds to reconfigure routes. This is especially critical for synchronous training, where the entire computation step is determined not by the average but by the slowest transfer in the cluster. With this balance, multiple tasks can share a single cluster with less risk of interfering with each other.

What Changes in Operations

Another important shift is moving away from conventional dynamic routing in favor of source routing based on SRv6. In a typical network, switches themselves recalculate routes through protocols like BGP, which adds complexity and introduces new failure modes. In MRC, the sender encodes the packet path directly in the IPv6 address, and switches simply execute this route sequentially using static tables.

The idea seems radical, but according to OpenAI, it simplifies the control plane and eliminates the need to constantly manually fix network logic. For OpenAI, practice matters more than theory, and here the company has concrete numbers. It reports that its training networks consist of millions of links, and in real deployments, multiple brief outages can occur between tier-0 and tier-1 switches every minute—with no measurable impact on synchronous pretraining.

During training of one of its recent frontier models for ChatGPT and Codex, engineers had to reboot four tier-1 switches, and this required no coordination with the teams that were conducting training at the time. If an eight-port network interface loses one port, throughput decreases by at most one-eighth, but the task itself continues to run rather than failing completely.

What This Means

MRC demonstrates that the race for stronger models is increasingly shifting to infrastructure. OpenAI is not just accelerating the training of its clusters but is also contributing the protocol to the Open Compute Project, attempting to turn its own engineering solution into an industry standard. If the approach is adopted by other labs and cloud providers, large AI clusters will become cheaper, simpler to operate, and more resilient to failures without constant manual network tuning.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Need AI working inside your business — not just in your newsfeed?

I build production AI for companies — custom CRM, internal tools, autonomous agents, workflow automation. Owned by you, shaped to your process, no per-seat tax. Built by Zhemal Khamidun, CPO of AlpinaGPT (AI platform, 6,000+ users).

What do you think?
Loading comments…