NVIDIA developed a method for training neural networks at 4-bit precision
NVIDIA developed NVFP4, a new methodology for training neural network models at 4-bit precision instead of the traditional 8-bit. The method halves memory requi

NVIDIA introduced NVFP4 — a new methodology for training neural networks with 4-bit precision. This allows significant savings in memory and computational resources when training large models.
How It Works
The standard approach uses 8-bit (FP8) or 16-bit (BF16) precision for storing intermediate results and training gradients. NVIDIA managed to halve these memory requirements by transitioning to the 4-bit NVFP4 format.
The method doesn't simply reduce precision, but combines several techniques: selective use of more precise BF16 on critical model layers, special mathematical transformations of gradient input data (16×16 random Hadamard transforms), and stochastic rounding during computations.
Traditionally, 4-bit training was considered risky — with prolonged training, rounding errors accumulate and lead to model degradation. The company tested NVFP4 on a hybrid Mamba-Transformer model with 12 billion parameters, training it on 10 trillion tokens — the longest public experiment with 4-bit training to date. This demonstrates that with the right methodology, numerical errors do not accumulate catastrophically.
Results Exceeded Expectations
The key metric was accuracy on the MMLU-Pro benchmark — a comprehensive knowledge test covering mathematics, natural sciences, humanities, and other fields. The NVFP4 model achieved 62.58%, which is literally just 0.04% lower than a model trained with the traditional FP8 method (62.62%). For practical applications, this difference is completely insignificant — accuracy within the margin of measurement error.
Against the backdrop of a twofold memory savings, this is a rare case where reducing numerical precision did not lead to a noticeable decline in result quality. This means NVFP4 doesn't sacrifice correctness for resource savings.
- Memory reduction: 2x compared to FP8
- Accuracy loss on benchmark: less than 0.1%
- Experiment scale: 10 trillion tokens
- Architecture: hybrid Mamba-Transformer model with 12 billion parameters
What This Means for the Industry
The result is important for companies training models from scratch. A twofold memory saving means the same volume of computations can be performed faster, cheaper, or the saved resources can be invested in training larger models. If your company trains a model on 1000 A100 GPU-days, NVFP4 can reduce this to 500 GPU-days while maintaining quality.
For researchers, this opens new opportunities for experimentation with architectures, data volumes, and hyperparameters. It becomes easier to test new ideas on larger models in a day than on smaller models in a week.
However, the method still requires additional validation on other model types — particularly on pure transformers and models with different architectures. NVIDIA has only shown results on the hybrid Mamba-Transformer architecture so far. It's also important to understand that 4-bit training is a specialized technique requiring specific software optimizations and hardware support (full support currently exists only on NVIDIA GPUs).