Matériel

Interconnect (NVLink, InfiniBand)

An interconnect is the high-speed communication link that transfers data between GPUs within a server (intra-node, e.g., NVLink) or between servers in a cluster (inter-node, e.g., InfiniBand); its bandwidth and latency are primary bottlenecks in distributed AI training and inference.

In AI computing, interconnect refers to the physical links and protocols that move data between processors. Intra-node interconnects such as NVIDIA NVLink operate between GPUs on the same server through a switching fabric (NVSwitch); inter-node interconnects such as InfiniBand and RoCE (RDMA over Converged Ethernet) connect servers within a data center. Because intra-node links are typically 5–50× faster than inter-node links, distributed training algorithms are structured to keep the most communication-intensive operations within a single node.

NVLink, introduced in 2016, has evolved through several generations. NVLink 4.0 in H100 systems provides 900 GB/s of bidirectional bandwidth per GPU when all eight GPUs in a node communicate through NVSwitch simultaneously — compared to approximately 128 GB/s for PCIe 5.0. InfiniBand, originally developed for high-performance computing (HPC), offers per-port bandwidths of 400 Gb/s (NDR, 2022) and 800 Gb/s (XDR, 2025) with sub-microsecond latency, enabling the all-reduce collective operations that synchronize gradients across thousands of nodes. Google's TPU pods use a proprietary inter-chip interconnect (ICI) with optical circuit switching between pods for their own training infrastructure.

Interconnect performance determines how efficiently a cluster scales with added nodes. If inter-node bandwidth is far below intra-node bandwidth, parallelism strategies must minimize cross-node data movement: tensor-parallel groups (which exchange partial sums on every layer) are confined to one node's NVLink domain, while pipeline parallelism (which only passes activations between stages) spans nodes over InfiniBand. All-reduce bandwidth — the collective that sums gradients across all participating GPUs — directly sets an upper bound on training throughput; if all-reduce time exceeds backward-pass time, GPUs idle.

As of 2026, NVIDIA's 800G InfiniBand and the NVLink Switch domain — capable of connecting up to 576 GPUs at full NVLink speeds across multiple chassis — represent the frontier. Competing approaches include Broadcom's Ultra Ethernet standards and silicon photonics for inter-rack optical links that reduce power consumption. The network topology of a cluster (fat-tree, rail-optimized, or dragonfly) is a primary differentiator between well-engineered and poorly-engineered AI infrastructure, as important to total training throughput as raw GPU FLOPS.

Exemple

In a training setup with 64 H100 servers (8 GPUs each), tensor-parallel groups of 8 GPUs within each server exchange data at 900 GB/s over NVLink, while gradient all-reduce across all 512 GPUs traverses 400G InfiniBand links and is overlapped with the next forward pass to hide latency.

Termes liés

← Glossaire