NVIDIA Developer Blog→ original

NVIDIA speeds up LLM pretraining: NVFP4 on Blackwell paired with JAX and MaxText

NVIDIA has published a technical guide to LLM pretraining on Blackwell chips: the NVFP4 format, paired with JAX and MaxText, reduces training time and…

AI-processed from NVIDIA Developer Blog; edited by Hamidun News
NVIDIA speeds up LLM pretraining: NVFP4 on Blackwell paired with JAX and MaxText
Source: NVIDIA Developer Blog. Collage: Hamidun News.
◐ Listen to article

Pretraining frontier LLMs runs up against computational system throughput limits. NVIDIA has demonstrated how the combination of JAX, MaxText, and the new NVFP4 format on Blackwell chips can significantly accelerate this process without sacrificing quality.

Why Every Percent Matters

When training runs across trillions of tokens through thousands of accelerators, saving even a single percent of time at each step translates into several days of real calendar time. At the scale of frontier pretraining, this is a direct conversion to millions of dollars in compute expenses. NVFP4 — a four-bit floating-point format that debuted in the Blackwell architecture — became one of the key tools for accelerating matrix operations.

Compared to FP8, it packs numbers twice as densely, which reduces memory load and increases the effective throughput of tensor cores. The main challenge: the four-bit numerical grid is sparse. With improper configuration, gradients easily exceed its bounds — this leads to training divergence.

NVIDIA and the MaxText team addressed this through custom scaling schemes and dynamic loss scaling.

How Mixed-Precision Works with NVFP4

Mixed-precision training is not a new approach: FP8 and BF16 have already become an industry standard. NVFP4 goes one step further, allowing 4-bit weights in the most computationally heavy matrix multiplications while maintaining higher precision where it truly matters.

  • NVFP4 is applied to weights and activations in GEMM operations
  • BF16 or FP32 remain for accumulators and normalization
  • MaxText automatically routes operations to the appropriate format
  • JAX compiles the computational graph through XLA, optimizing kernels for Blackwell
  • Result — throughput growth with comparable or lower power consumption

The Stack and What to Change in Code

MaxText is an open-source high-performance training framework based on JAX, developed by Google. It was originally created for TPU, but is actively being adapted for GPU clusters, and the partnership with NVIDIA is a natural fit here. NVIDIA included low-level NVFP4 kernels as part of cuBLAS and cuDNN, and JAX/XLA received support for these operations through special adapters. Developers do not need to rewrite training code manually — it is sufficient to enable the necessary flags in MaxText configurations and ensure that the cluster has Blackwell chips installed (B100, B200, GB200).

"Numerical precision is one of the most leveraged parameters, but low-bit mixed-precision pretraining is difficult to implement correctly," notes the NVIDIA

Developer Blog team.

What This Means

For teams engaged in pretraining frontier models, NVFP4 on Blackwell is practically free acceleration: the existing stack on JAX and MaxText requires minimal configuration changes. At the scale of hundreds and thousands of GPUs, even 10–15% throughput gains directly reduce time-to-checkpoint and overall compute budget. The race for pretraining efficiency is moving into the phase of fighting for numerical precision.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…