Hardware

VRAM

VRAM (Video Random Access Memory) is the dedicated high-speed memory on a GPU used to store model weights, activations, and KV cache during AI training and inference. Its capacity is the primary constraint on which models can run on a given GPU.

VRAM is the on-board memory directly attached to a GPU, distinct from system RAM. In AI workloads it stores model parameters (weights), input data, activations, gradient tensors during training, and key-value caches during inference. Because GPU compute cores fetch data from VRAM millions of times per second, bandwidth and capacity directly govern throughput and the maximum model size that can be loaded.

The relationship between model size and VRAM is roughly linear: a model with 7 billion parameters stored in 16-bit float requires about 14 GB for weights alone, before accounting for activations, optimizer states, or KV cache. Quantization techniques (GPTQ, AWQ, bitsandbytes) reduce this footprint by 2–4×. During training, peak VRAM usage is typically 3–6× the weight footprint because the Adam optimizer maintains two additional tensors per parameter alongside gradients.

VRAM capacity is a hard ceiling that shapes the hardware market. NVIDIA's A100 shipped with 40 or 80 GB of HBM2E; the H100 carries 80 GB HBM3; the H200 expands to 141 GB HBM3E. Consumer GPUs such as the RTX 4090 carry 24 GB GDDR6X — sufficient for inference on models up to roughly 13–20B parameters at reduced precision, but far short of data-center cards. This gap drives multi-GPU serving and CPU-offloading frameworks such as llama.cpp.

As of 2026, VRAM remains the central engineering constraint for deploying frontier models. GPT-4-class systems require hundreds of gigabytes distributed across multiple GPUs via tensor and pipeline parallelism. AMD's MI300X (192 GB HBM3 per card) and NVIDIA's B200 (192 GB HBM3E) are designed to push the ceiling higher, while algorithms such as FlashAttention reduce peak VRAM consumption by recomputing activations on the fly rather than storing them.

Exemplo

A developer running a 70B-parameter model quantized to INT4 (~35 GB) can fit it on two H100 GPUs (80 GB VRAM each), using tensor parallelism to split weight matrices across both cards for inference.

Termos relacionados

Últimas notícias sobre o tema

← Glossário