NVIDIA explained how to train transformers at reduced precision without losing quality
NVIDIA published a technical guide to training transformer models at reduced precision — FP8 and BF16. The larger the model, the more expensive each…
AI-processed from NVIDIA Developer Blog; edited by Hamidun News
NVIDIA published an extensive guide on the Developer Blog on optimizing transformer architectures for training with reduced-precision computations — FP8 and BF16. The article is addressed to engineers who want to reduce the cost of training runs without compromising model quality.
Why Teams Need This
Transformers form the foundation of most modern language and generative models. As model sizes grow — from billions to tens of billions of parameters — the cost of a single training run increases exponentially. Each experiment iteration consumes more GPU-hours, which slows down development and increases costs. In practice, this means that slow training is not just a technical inconvenience. It is a constraint on how many hypotheses a team can test in a quarter and how large a model they can afford. NVIDIA calls transformer acceleration not an optimization, but a requirement for competitiveness.
What is Low-Precision Training
Standard neural network training is conducted in 32-bit format (FP32), which provides high numerical precision but consumes a lot of memory and runs slower on modern GPUs. Reducing bit width allows fitting more data into video memory and accelerating matrix operations:
- FP16 — 16-bit floating-point numbers; supported by most modern GPUs
- BF16 — Brain Float 16; wider dynamic range, better suited for unstable training of large models
- FP8 — 8-bit format, available on Hopper architecture (H100, H200); provides a twofold increase in matrix operation throughput compared to BF16
- INT8 — 8-bit integer; more often used for inference than for training
The main challenge is to maintain numerical stability when reducing precision this dramatically. A naive transition from FP32 to FP8 leads to diverging gradients and unstable training.
Techniques NVIDIA Recommends
Simple format replacement does not work, so NVIDIA describes several proven approaches.
Mixed precision. Weights are stored in FP32, while forward and backward passes are executed in FP16 or BF16. This combines the speed of low-precision computation with the reliability of full-precision parameter storage — the de facto standard for most modern training pipelines.
Loss scaling. FP16 poorly represents very small numbers — gradients in later layers can underflow. Loss scaling artificially increases the loss function value before the backward pass, then scales gradients back. Modern implementations do this automatically and adaptively.
TransformerEngine. A specialized NVIDIA library that automatically manages precision at the level of individual transformer layers. Supports FP8 on Hopper, integrates with PyTorch, JAX, and Megatron-LM. Instead of rewriting all training code, an engineer simply connects TransformerEngine and gets FP8 acceleration with minimal changes.
"As models grow, training runs consume increasingly more GPU-hours and engineering time.
This directly affects how quickly teams can experiment and how large a model they can afford," — NVIDIA Developer Blog.
What This Means
The guide is published at a moment when training efficiency has become as important as model accuracy. Teams on H100 or H200 receive concrete guidance: FP8 via TransformerEngine is one of the most accessible ways to reduce GPU budget without rearchitecting. For small labs, this can mean the difference between being able to train a 70-billion-parameter model or having to abandon it due to cost.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.
The AI world, distilled — once a week
Seven stories that actually mattered, hand-picked. No noise, no reposts, no press releases.
Done! Check your inbox for a confirmation.