OpenAI introduced MRC — a network protocol for 100,000-GPU AI training clusters
Through the Open Compute Project, OpenAI released the MRC specification, a new network protocol for training large models. It splits the traffic of a single com

◐ Слушать статью
Through the Open Compute Project, OpenAI released the MRC specification, a new network protocol for training large models. It splits the traffic of a single communication across hundreds of paths, routes around failures faster, and simplifies network architecture. According to the company, MRC is already running in the largest clusters with NVIDIA GB200 and can withstand link failures and even switch reboots without interrupting training.