UC Berkeley created mKernel: a unified library for GPU synchronization in clusters
The UCCL team from UC Berkeley released mKernel — a CUDA library for GPU synchronization in massive clusters. Instead of three separate tools, it combines local

The UCCL team from UC Berkeley has released mKernel — a CUDA library that revolutionizes GPU synchronization in large clusters. The library combines local communication between GPUs, cross-server synchronization, and computation in a single persistent kernel, avoiding constant context switches.
Bottleneck in Large-Scale Clusters
In large data centers, synchronization between GPUs is one of the primary bottlenecks in bandwidth. When engineers work on distributed training of large models, they face a problem that has traditionally been solved very inefficiently. The old approach was layered: engineers used three separate tools and libraries.
The first — for fast communication within a single server (NVLink, which works through high-speed cables between GPUs). The second — for synchronization between different servers over the network (RDMA, which uses specialized network adapters). The third — for the computations themselves.
Each transition between these three systems freezes the entire GPU pipeline: it is forced to stop, switch context, unload part of memory, load new data, and only then continue working. On clusters with thousands of GPUs, these microsecond delays accumulate into minutes of lost performance.
How mKernel Solves the Problem
mKernel changes the philosophy radically. Instead of three separate systems, all operations — local communication, network synchronization, and computation — work in a single persistent kernel. This is a program that constantly lives in GPU memory and is never offloaded to the CPU. The architecture combines three components:
- NVLink for communication between GPUs on one server — provides speeds 10-20x higher than PCIe, thanks to direct high-speed channels between GPUs
- RDMA through network adapters for synchronization between servers — avoids CPU overload and operating system delays when transferring data between machines
- Dense computations embedded directly in the kernel — the GPU works on local data without switches, synchronizes with neighbors, and immediately moves to the next task
This means the GPU can seamlessly transition from local NVLink communication to global RDMA synchronization to its own computations — all in one piece of code, without any stops.
Concrete Example: How It Works in Practice
In distributed training, one GPU cannot move forward until other GPUs on different servers finish their computations and synchronize gradients. In the old approach, the GPU simply waits with a completely empty pipeline. With mKernel the process is different: the GPU continues local computations on already loaded data, simultaneously synchronizes with neighbors via NVLink and RDMA, and immediately moves to the next training wave without any interruptions. It is like how an auto factory assembly line does not stop while a part moves to the next station.
Why This Is Critical for Data Centers
Distributed training of large models is one of the most complex computational tasks in modern AI development. When you simultaneously use 1000 GPUs (and large companies work with even more), even a small synchronization slowdown can waste 20-30% of all resources pointlessly. mKernel promises to completely eliminate this overhead. In initial tests, researchers already see results: on multi-node clusters, the library shows 2-3x acceleration of synchronization on typical operations. This is especially critical for the attention mechanism in transformers, where synchronization of gradients between GPUs is the most expensive part of all the work.
What This Means
mKernel is a signal that GPU programming is entering an era of integrated systems. Previously, engineers wrote code in layers: first computations, then synchronization, then transmission. Now the boundary between them is blurring. This means faster computing in data centers, this means more accessible and faster training of large models, and most importantly — the next generation of distributed systems will be designed in a completely new way.
Хотите не читать про ИИ, а внедрить его?
«AI News» — это полезные новости из мира ИИ. Системно научиться работать с нейросетями и применять их в работе — в Hamidun Academy.