. Settle for packet loss on objective. Spray every switch throughout tons of of random paths. If somebody handed you this record of design selections for a community connecting 131,000 GPUs, you’ll assume it was written by somebody who had by no means operated a manufacturing community.
A consortium of OpenAI, AMD, Broadcom, Intel, Microsoft, and NVIDIA constructed precisely this — and quietly inverted three many years of consensus about how high-performance information heart networks ought to work.
The protocol is known as MRC, quick for Multipath Dependable Connection. It was launched on Might 5, 2026 by way of the Open Compute Project. The accompanying research paper (Araujo et al., 2026) particulars its deployment throughout OpenAI’s largest NVIDIA GB200 supercomputers, together with the Stargate web site with Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft’s Fairwater supercomputers. MRC has been used to coach the newest frontier fashions behind ChatGPT and Codex.
What’s most putting on shut studying of the paper is one thing the press protection has not surfaced: MRC successfully eliminates your entire Layer 3 management aircraft from the info heart material. No OSPF. No BGP. No IS-IS. No FIB. The switches within the deployment keep zero dynamic forwarding state. To the creator’s data, that is essentially the most aggressive elimination of dynamic routing in any manufacturing AI coaching material publicly documented up to now.
The paper’s core argument is that at 100,000+ GPU scale, tail latency from community congestion and failures dominates coaching efficiency, and the standard networking stack can not clear up this with out elementary adjustments to how packets transfer between GPUs. MRC is these elementary adjustments, carried out in 800 Gb/s NICs from three completely different silicon distributors and deployed in manufacturing.
What makes MRC price learning rigorously shouldn’t be that it’s quick. It’s that the design selections behind it contradict a number of ideas that the networking group has handled as settled for many years. Understanding why these selections work at this scale, and the place they may not, issues for anybody constructing or working AI infrastructure.
Left: standard RoCE with single-path routing. A congested T1 hyperlink triggers PFC PAUSE that propagates backward, blocking GPU 2 though its personal path was clear. All 100,000 GPUs idle till GPU 2’s switch completes. Proper: MRC sprays packets throughout 8 unbiased planes. When a hyperlink fails in Airplane 2, the NIC retires that entropy worth and redistributes visitors to the remaining 7 planes in microseconds. No GPU ever stalls. The 5 numbered design selections on the backside are the topic of this text.
[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
Every of MRC’s selections is individually acquainted to anybody who has adopted networking analysis. The mix is what’s radical. The networking group has explored each certainly one of these concepts in isolation — multi-plane materials, supply routing, packet spraying, lossy transports with selective retransmission, ECN as a load-balancing sign. What makes MRC price cautious research is that the OpenAI consortium dedicated to all of them, concurrently, in manufacturing at 131,000 GPUs.
The issue: one straggler blocks 100,000 GPUs
Synchronous pretraining runs in lock-step. Each coaching step entails thousands and thousands of knowledge transfers throughout 1000’s of GPUs performing a mix of tensor parallelism, pipeline parallelism, information parallelism, and professional parallelism. The step can not advance till the slowest switch completes. At 100,000 GPUs, the period of every communication spherical is decided by the tail of the switch latency distribution, not the imply.
The paper frames this exactly: “As computations scale, communication turns into more and more outlier-dominated.” A single congested hyperlink, a single circulate collision, a single swap buffer overflow can stall 1000’s of GPUs for milliseconds. On the hourly price of 100,000 H100-class GPUs (roughly $300,000 per hour at cloud charges), a 10-millisecond stall that happens as soon as per coaching step and repeats throughout 1000’s of steps shouldn’t be a rounding error. It’s a line merchandise.
Community failures compound the issue. At this scale, hyperlink flaps, optic failures, and swap reboots usually are not uncommon occasions. They’re statistical certainties that happen a number of instances per day throughout a cloth with tons of of 1000’s of hyperlinks. The paper experiences a manufacturing incident the place an optical transceiver on a T0 swap “suffered a glitch, and flapped all its 4 hyperlinks in fast succession,” affecting three energetic coaching nodes concurrently. In a traditional community, this might have crashed the coaching job.
MRC’s design objective was not simply larger bandwidth. It was predictable bandwidth, even within the presence of failures, with a management aircraft easy sufficient {that a} small staff can handle a number of supercomputers concurrently.
The topology: 131,000 GPUs in two swap tiers
The primary design determination is architectural, not protocol-level. As a substitute of treating an 800 Gb/s NIC as one fats pipe, MRC splits it into eight 100 Gb/s hyperlinks, every connecting to a special swap. This creates eight parallel community planes, every working independently.
Think about a traditional strategy. As we speak’s quickest datacenter Ethernet switches provide 51.2 Tb/s of switching capability, yielding 64 ports at 800 Gb/s. In a typical fat-tree Clos topology, every Tier-0 (T0) swap connects right down to 32 NICs and as much as 32 Tier-1 (T1) switches. Every T1 swap connects to 64 pods. That provides you a 3-tier community supporting roughly 64,000 GPUs at full bisection bandwidth. To achieve 100,000, you want a fourth tier, which provides latency, price, and failure domains.
Now break up the NIC. The identical 51.2 Tb/s swap at 100 Gb/s per port offers you 512 ports as an alternative of 64. Every T0 swap connects right down to 256 NIC ports and as much as 256 T1 switches. Every T1 connects to 512 T0s. A single two-tier aircraft helps 131,072 GPUs at full bisection bandwidth.
The paper quantifies the financial savings:
Typical 3-tier (800 Gb/s):
- 3 swap tiers, 64-port switches
- Max ~64K GPUs at full bisection BW
- 5-hop or 7-hop worst-case path
Multi-plane 2-tier (8 × 100 Gb/s):
- 2 swap tiers, 512-port switches
- 131K GPUs at full bisection BW
- 3-hop worst-case path
- 2/3 the optics of a 3-tier community
- 3/5 the variety of switches

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
The resilience math is equally compelling. Dropping a single NIC-to-T0 hyperlink in an 800 Gb/s single-plane community prices 3% of that NIC’s bandwidth. In a 100 Gb/s multi-plane community, the identical failure prices 0.4%. Extra importantly, with eight unbiased planes, the NIC can proceed working on the remaining seven whereas the failed hyperlink is repaired. The coaching job doesn’t must cease.
This tradeoff shouldn’t be free. Eight separate planes imply eight instances as many hyperlinks to watch, eight instances as many potential failure factors in mixture, and a transport protocol that should load-balance intelligently throughout all of them. That’s the place MRC itself is available in.
Packet spraying with entropy values
Typical RDMA transports (RoCEv2, InfiniBand RC) pin every connection to a single community path. The trail is chosen by hashing the circulate’s five-tuple (supply/vacation spot IP, supply/vacation spot port, protocol) at every swap. As soon as pinned, each packet in that connection follows the identical path till the connection is torn down.
This works at average scale. It fails at 100,000+ GPUs due to circulate collisions. When two connections hash to the identical path by way of the identical bottleneck hyperlink, each undergo. The likelihood of collision will increase with scale, and the tail latency affect is disproportionate.
MRC eliminates circulate pinning totally. As a substitute, it assigns every Queue Pair (QP) a set of 128 to 256 entropy values (EVs) at connection setup. Every EV encodes a particular path by way of a particular community aircraft. The sender rotates by way of its EV set packet by packet, spraying consecutive packets throughout tons of of various paths throughout all eight planes. No two consecutive packets from the identical switch take the identical route.
The EV is a 32-bit worth break up throughout the UDP supply port and the IPv6 circulate label in every MRC packet. Switches hash on these fields, so altering the EV adjustments the trail. The sender doesn’t must know the topology. It solely must know that completely different EVs produce completely different paths.
Per-QP state:
EV set: 128-256 entropy values (32-bit every)
Per-EV well being: {energetic, congested, suspected_failed, confirmed_failed}
Packet sending:
for every packet in switch:
ev = next_active_ev(qp.ev_set)
packet.udp_src_port = ev[0:15]
packet.ipv6_flow_label = ev[16:31]
ship(packet)
Every EV carries just a few bits of well being state. When the receiver detects congestion on a path (through ECN marking from switches), it echoes this again to the sender, which briefly avoids that EV. When a packet is definitely misplaced (not trimmed), MRC assumes the trail has failed and instantly stops utilizing that EV. Background probes periodically check retired EVs to find out whether or not the failure was transient, resurrecting them if probes succeed.
The load-balancing high quality of this scheme is excessive. As a result of completely different senders independently generate random EV units, the mixture visitors distribution throughout paths is near-uniform. Small imbalances are smoothed by the ECN suggestions loop: if one path accumulates barely extra visitors, ECN marks enhance on that path, and senders redistribute to less-loaded options.

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
Static supply routing with SRv6
That is essentially the most counterintuitive determination within the paper. Each manufacturing datacenter community runs dynamic routing protocols (BGP, OSPF, IS-IS) that compute forwarding tables, react to topology adjustments, and converge after failures. MRC disables all of them.
As a substitute, MRC makes use of IPv6 Phase Routing (SRv6) to encode the total path every packet ought to take. The sender embeds the sequence of swap identifiers straight into the packet’s vacation spot tackle. Every swap alongside the trail checks if its identifier is current, removes it by shifting the tackle, and forwards to the subsequent hop. No routing desk lookup. No forwarding data base. No management aircraft convergence.
The paper explains the logic: “We took the weird place of disabling dynamic routing within the switches as a result of we didn’t need two adaptive routing mechanisms interacting with one another and dynamic routing wasn’t including something.”
MRC’s transport-layer adaptation (EV administration, ECN suggestions, path probing) already handles failures at microsecond timescales. Dynamic routing protocols converge in seconds to minutes. Working each creates a danger of conflicting selections: MRC avoids a failed path on the transport layer whereas the routing protocol continues to be converging to a brand new forwarding state, probably creating routing loops or oscillations.
By eradicating dynamic routing totally, MRC will get three operational advantages:
First, deterministic forwarding. Each packet follows a identified, pre-computed path. If one thing goes mistaken, you may hint precisely which switches the packet traversed. The paper notes that this “offers us excellent observability” as a result of the trail is encoded within the packet itself.
Second, eradicated convergence failures. Dynamic routing protocols can misconfigure, loop, or partition the community throughout convergence. With static SRv6 routes, these failure modes don’t exist. The switches are stateless packet forwarders.
Third, simplified operations. The paper emphasizes that “very small groups of individuals want to have the ability to handle the networks of a number of supercomputers.” Eradicating routing protocols removes a whole class of operational complexity, configuration drift, and debugging floor space.
The tradeoff is that path computation strikes to the NIC. The MRC NIC should know sufficient concerning the topology to generate legitimate SRv6 paths for its EV set. In OpenAI’s deployment, that is dealt with at QP setup time utilizing a easy topology database. The paths are static and pre-computed. Runtime adaptation occurs on the EV choice stage, not on the routing stage.

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
Working lossy: why MRC disables PFC
That is the choice that may shock most networking practitioners. RDMA networks have historically relied on Precedence Circulation Management (PFC) to create lossless Ethernet materials. When a swap buffer fills, PFC sends a pause body upstream, stopping the sender from transmitting till the buffer drains. InfiniBand has the same credit-based circulate management mechanism. The whole “lossless material” paradigm exists to assist RDMA’s assumption that packets don’t get dropped.
MRC explicitly disables PFC and runs on customary best-effort (lossy) Ethernet.
The reason being head-of-line blocking. When a PFC pause body fires on one port, it could block visitors destined for different ports that share the identical ingress buffer. In a big coaching cluster working a number of collectives concurrently, a PFC pause triggered by one collective’s incast can delay transfers from a very unrelated collective. This cross-collective interference creates precisely the tail latency outliers that MRC is designed to get rid of.
The paper’s answer is a mix of three mechanisms:
First, selective retransmission. MRC tracks which packets have been acquired utilizing Selective ACKs (SACKs). When loss is detected, solely the lacking packets are retransmitted, not your entire window. That is sooner than go-back-N retransmission utilized in some RoCE implementations.
Second, packet trimming. When a swap would drop a packet as a consequence of buffer overflow, it as an alternative trims the payload and forwards simply the header as a precedence packet. The receiver will get the trimmed header, acknowledges the hole, and sends a NACK to set off rapid retransmission. This eliminates the timeout delay between loss detection and retransmission. It additionally lets MRC distinguish between congestion drops (trimmed packets) and hyperlink failures (no packet in any respect), enabling completely different restoration methods for every.
Third, out-of-order reminiscence placement. Each MRC information packet carries the RDMA digital tackle and distant key. The receiving NIC can write every packet on to its closing reminiscence location no matter arrival order. That is vital as a result of packet spraying throughout tons of of paths ensures that packets will arrive out of order. With out direct placement, the receiver would want reorder buffers, including latency and reminiscence overhead.

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
ECN repurposed: load balancing, not congestion management
In standard networks, Specific Congestion Notification (ECN) indicators congestion to the sender, which responds by lowering its transmission fee (just like TCP congestion management). MRC repurposes ECN totally.
In MRC’s multi-plane topology with full bisection bandwidth, mixture congestion shouldn’t exist below regular operation. The overall out there bandwidth exceeds the overall demand. What does exist is native path imbalance: some paths could also be barely extra loaded than others as a result of random EV choice throughout completely different senders.
MRC makes use of ECN as a per-path load sign. Switches mark packets with ECN in the usual randomized method, however MRC disables ECN marking on the final hop to the receiver (to keep away from complicated last-hop incast with material congestion). The receiver echoes ECN marks again to the sender, tagged with the precise EV that was marked. The sender then briefly avoids that EV, shifting visitors to less-loaded paths.
This transforms ECN from a rate-control mechanism right into a routing-level load-balancing sign. The sender doesn’t decelerate. It redirects. The excellence issues as a result of lowering fee wastes GPU time (the switch takes longer), whereas redirecting maintains throughput whereas smoothing out imbalances.

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
What the manufacturing proof exhibits
The paper experiences outcomes from two contexts: manufacturing frontier mannequin coaching and managed testbed experiments.
In manufacturing, MRC allowed coaching jobs to experience out community failures that beforehand would have crashed the job. The paper describes the optical transceiver glitch talked about earlier: 4 hyperlinks flapped in fast succession on three energetic coaching nodes. MRC detected the trail failures, stopped utilizing the affected EVs, and redistributed visitors throughout remaining paths. The coaching job continued with out interruption. In a traditional RoCE deployment, this occasion would have triggered PFC storms, NCCL timeouts, and a job restart costing hours of GPU time.
The testbed experiments quantify MRC’s efficiency traits:
Level-to-point bandwidth: MRC achieves near-line-rate throughput on 800 Gb/s hyperlinks with packet spraying. The paper experiences comparability with customary RoCE displaying MRC’s benefit below multi-path situations.
Hyperlink failure restoration: when a hyperlink goes down, MRC detects it and redistributes visitors in tens of microseconds. No sender-side timeouts. No routing protocol convergence. The EV that mapped to the failed path is retired instantly, and the remaining EVs take up the visitors.
Load balancing throughout EVs: the paper measures visitors distribution throughout planes and paths, displaying near-uniform utilization below manufacturing workloads.
NCCL collective efficiency at scale: the paper evaluates MRC’s efficiency on all-reduce operations, that are the dominant communication sample in data-parallel coaching. MRC’s packet spraying eliminates the flow-collision downside that degrades all-reduce efficiency at scale with standard ECMP hashing.
The operational proof helps the static routing determination. The paper experiences that T1 core switches had been rebooted throughout energetic coaching runs with out disrupting the job. In a traditional community with dynamic routing, rebooting a core swap triggers reconvergence throughout the material. With static SRv6, the swap merely reloads its static forwarding state and resumes. MRC’s transport layer dealt with the non permanent lack of paths by way of that swap by redistributing visitors to different planes.

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
The place these design selections are strongest
MRC was designed for a particular workload profile: synchronous pretraining with all-reduce dominated communication, working on a single-tenant material with full bisection bandwidth. Inside these constraints, the three design selections are well-matched to the issue:
Static routing works as a result of the topology is fastened and identified at deployment time. Coaching clusters don’t add or take away switches throughout a coaching run. The failure modes are link-level (dealt with by MRC’s EV administration), not topology-level (which might require routing protocol reconvergence).
Lossy Ethernet works as a result of the selective retransmission and packet trimming mechanisms recuperate sooner than PFC pause frames propagate. The cross-collective head-of-line blocking that PFC creates is extra damaging to tail latency than the occasional retransmission overhead.
ECN-as-load-balancing works as a result of the multi-plane topology offers full bisection bandwidth, making certain that mixture congestion doesn’t happen. Native imbalances are the one congestion supply, and ECN-guided EV avoidance is a exact, low-overhead mechanism for smoothing them.

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
The boundary situations: the place MRC works and the place it doesn’t
MRC is a production-proven protocol for its goal workload. The pure questions for the broader AI infrastructure group concern the boundary situations.
First, multi-tenancy. OpenAI’s coaching clusters run a single coaching job at a time throughout the total material. Most cloud suppliers and enterprise deployments share GPU clusters throughout a number of workloads. MRC’s static routing assumes a secure topology database on the NIC stage. In a multi-tenant atmosphere the place workloads are dynamically positioned, the topology seen to every NIC adjustments ceaselessly. Whether or not MRC’s path-generation logic adapts to this or requires modifications is an open engineering query.
Second, inference workloads. MRC was designed for synchronous coaching’s all-reduce communication sample: giant bulk transfers between identified units of GPUs. Inference workloads, notably disaggregated inference with KV cache transfers between prefill and decode swimming pools, have a special communication profile: smaller transfers, point-to-point slightly than collective, and latency-sensitive on the particular person request stage slightly than the mixture step stage. Packet spraying throughout tons of of paths provides jitter to particular person switch latency, which can or could not matter relying on the SLO necessities.
Third, oversubscribed networks. MRC’s ECN-as-load-balancing mechanism depends on full bisection bandwidth. In oversubscribed networks (frequent in cloud environments the place price optimization drives topology selections), mixture congestion is actual, not simply native imbalance. ECN would want to perform as a real congestion sign on this case, which adjustments MRC’s circulate management dynamics.
Fourth, interoperability. MRC is presently carried out in particular NIC silicon (NVIDIA ConnectX-8, AMD Pollara/Vulcano, Broadcom Thor Extremely) and particular swap platforms (NVIDIA Spectrum-4/5, Arista EOS on Broadcom Tomahawk 5). The OCP launch of the specification allows broader implementation, however silicon-level protocol assist takes 12-18 months to develop and validate. Close to-term adoption might be restricted to organizations utilizing these particular {hardware} platforms.
These usually are not criticisms of MRC. They’re the engineering questions that come up naturally when a protocol designed for a particular, well-defined atmosphere meets the variety of the broader infrastructure market. The truth that MRC solved the tail latency downside at 131,000-GPU scale is a real achievement. The query for the remainder of the group is which of its design selections generalize and that are particular to the constraints of single-tenant, full-bisection-bandwidth coaching materials.
What MRC indicators about the way forward for AI networking
MRC represents a broader shift in how AI infrastructure thinks about networking. The traditional strategy treats the community as a clear pipe: packets go in a single finish and are available out the opposite, and the transport protocol’s job is to fill the pipe as effectively as doable. MRC treats the community as a managed useful resource with observable, per-path well being indicators that the transport protocol actively exploits.
This isn’t a brand new concept in networking analysis. Multipath TCP, Valiant load balancing, and ECMP have explored variations of it for years. What’s new is the size at which MRC operates, the aggressiveness of its design selections (no PFC, no dynamic routing, full packet spraying), and the manufacturing proof that it really works on the most important AI coaching clusters on the planet.
For networking practitioners, MRC validates a thesis that has been debated for a decade: at enough scale, endpoint intelligence beats community intelligence. Making the NIC smarter and the swap easier produces a extra resilient system than making the swap smarter and the NIC easier. Whether or not you agree with each design determination or not, the manufacturing proof from OpenAI and Microsoft makes this argument more durable to dismiss than it was per week in the past.
The MRC specification is offered by way of OCP below an open license. The research paper offers detailed experimental outcomes. For anybody constructing GPU clusters at scale, each are price studying rigorously. The three guidelines MRC breaks could be the identical three guidelines holding your community again.

