NVIDIA packed 3 models into one file and made training 360× more efficient
NVIDIA introduced Star Elastic, a method that trains three models of different sizes (30B, 23B, and 12B parameters) in a single 160B-token cycle. Compute saving

NVIDIA introduced Star Elastic — a method that packs three models of different sizes (30B, 23B, and 12B parameters) into a single weights file, training all of them from scratch in a single training run instead of three separate trainings.
360× Training Cost Savings
Star Elastic is based on the Nemotron Elastic framework and applied to Nemotron Nano v3 — NVIDIA's new generation of models. The key feature: all three model variants train in a single 160B-token cycle. For comparison: if NVIDIA trained each model separately, it would require approximately 360× more computation. This is a massive saving, especially considering the cost of computation on supercomputers. The conventional approach requires either training each size separately (expensive) or pruning weights from a larger model (accuracy loss). Star Elastic does a third thing: it embeds nested models into a single checkpoint while fully preserving the quality of each size. All three models are stored in one file and can be called during inference.
Inference Becomes Faster and More Accurate
But training is only half the battle. Star Elastic introduces elastic budget control — a new inference approach that maximizes the benefits of all three models at once. The idea is simple: during the "reasoning" phase (when the model deliberates) a small 12B model is used to save computation, while in the final output phase — the full 30B model is used for the most accurate answer. The results are impressive:
- 16% higher accuracy compared to standard budget control
- 1.9× lower latency — the model responds faster
- Flexibility: organizations can choose reasoning depth depending on the task and budget
Compare this to standard budget control — it's roughly the same thing, but without the option to flexibly switch between sizes during inference. Here, switching is built into the algorithm itself and works automatically.
The Entire Family Now Fits on RTX
Star Elastic allows models to be quantized in FP8 and the proprietary NVFP4 format (more efficient than standard formats). This means: the entire trio of models can live on a single RTX GPU, even on consumer graphics cards. Previously, a 30B model required professional equipment like the H100, which is inaccessible to many companies. Now engineers can experiment with powerful models on their own PCs.
"This democratizes access to reasoning models," — in this spirit,
NVIDIA's developers argue.
What This Means
Organizations no longer need to choose between speed (small model) and quality (large model) at training time. They train once and choose the compromise during inference — flexibly, without retraining. This reduces costs not only for training but also for inference servers. In practice: you pay less for GPU hours and get more flexibility in production.