NVIDIA Nemotron: Diffusion Models Generate Text 6x Faster

Q: What is the source?

Originally published on Hugging Face Blog. Hamidun News processes and adapts the material with AI.

Q: When was it published?

May 25, 2026. Reading time: 3 min.

NVIDIA Nemotron generates 32 tokens at once instead of one, using diffusion instead of autoregression. This fundamental shift in approach enables parallel…

Hamidun News Editorial

AI monitoring · Hugging Face Blog

May 25, 2026· 3 min

AI-processed from Hugging Face Blog; edited by Hamidun News

NVIDIA Nemotron: Diffusion Models Generate Text 6x Faster — Source: Hugging Face Blog. Collage: Hamidun News.

◐ Listen to article

NVIDIA introduced Nemotron-Labs Diffusion — the first language models that generate multiple tokens simultaneously instead of step-by-step output. This fundamentally changes the approach to text generation speed and GPU efficiency.

Why Conventional Models Are Slow

All modern language models operate in autoregressive mode: they generate one token, then the next, then another. This means that even a powerful GPU spends time waiting at each step. To generate a 100-token sentence, the model must complete 100 passes, activating the entire neural network graph each time. Modern processors (especially B200) spend more time on memory access than on actual computations — this is the bottleneck.

Diffusion Instead of Autoregression

Nemotron solves this problem using diffusion models. The idea is simple: generate many tokens at once, then refine them. The model supports three operating modes on a single checkpoint:

Autoregressive — standard mode, word by word, for compatibility
FastDiffuser — generates blocks of 32 tokens at a time and iteratively refines them through multiple passes
LinearSpec — diffusion-based draft generation plus autoregressive verification, delivering 6× speedup on B200

Developers simply select the mode at launch — application code remains unchanged.

Performance Metrics

Nemotron 8B vs. comparable models:

On B200 GPU in speculative mode, it achieves ~865 tokens per second
2.6× more tokens per neural network pass
+1.2% accuracy compared to Qwen3 8B
In the fastest mode, it generates 6.4× more tokens than conventional models

You can reduce the number of refinement passes if fewer computations are needed — engineers manage the quality-speed tradeoff.

Three Sizes and Ready-to-Use Weights

NVIDIA released models with 3B, 8B, and 14B parameters. Each comes in two variants: base (trained on 1.3 trillion tokens) and instruction-tuned for chat. All training code and SGLang integration (a popular inference framework) are already open on GitHub.

What This Means

Diffusion models are no longer experiments in laboratories — they're entering production. For developers, this means taking a single model and switching between modes based on speed requirements: slower but accurate for critical tasks; fast for bulk operations. For service providers, it offers an opportunity to reduce inference costs and minimize latency in user-facing responses.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation