NVIDIA introduced Nemotron-Labs-Diffusion: a model with triple decoding

NVIDIA introduced Nemotron-Labs-Diffusion, a language model that supports three decoding modes: autoregressive, diffusion, and speculative. The key result is a 6× increase in token processing speed relative to Qwen3-8B. The model is available in three sizes (3B, 8B, 14B), with variants for base tasks, instruction following, and multimodal applications.

Hamidun News Editorial

AI monitoring · MarkTechPost

May 21, 2026· 2 min

AI-processed from MarkTechPost; edited by Hamidun News

NVIDIA introduced Nemotron-Labs-Diffusion: a model with triple decoding — Source: MarkTechPost. Collage: Hamidun News.

◐ Listen to article

NVIDIA has introduced Nemotron-Labs-Diffusion — a new family of language models that combines three methods of text generation within a single architecture. This engineering solution addresses the main bottleneck of modern LLMs: standard models generate text sequentially, one token after another, which limits processing speed and server throughput.

Three modes in one architecture

Nemotron-Labs-Diffusion supports three decoding modes simultaneously. The first is autoregressive (AR), classic like ChatGPT: the model looks at everything it has written so far and generates the next token. The second is diffusion-based parallel, when the model generates multiple tokens at once simultaneously, as if "drawing" text from both sides.

The third is speculative, where the model quickly predicts a block of tokens, then validates the predictions in a single pass. This hybrid approach allows selecting a mode depending on the task: real-time chat — use speculative (fast), batch document processing — use diffusion-based (parallel), audit or verification — use autoregressive (precise). Autoregressive: classic sequential generation, predictable Diffusion-based: parallel generation of multiple tokens at once * Self-speculation: efficient fast prediction with single-pass verification ## Six times faster on tokens The results speak for themselves.

With the same model size (8B parameters) Nemotron-Labs-Diffusion processes 6 times more tokens in a single forward pass than Qwen3-8B. This is a huge difference. For commercial applications this means either cheaper to serve users, or more users on a single server.

Important: this is not about the response speed to a single message, but about overall throughput. A server can process 6 sequences in parallel, instead of one.

Three sizes, three variants NVIDIA has released

Nemotron-Labs-Diffusion in three sizes: 3B (for edge devices and mobiles), 8B (balanced option) and 14B (for maximum quality and complexity). Each size is available immediately in three variants: base (base model), instruct (optimized for chatbots and instructions) and vision-language (works with images and text). This means a company can take an 8B model with vision support, and immediately have three generation modes plus multimodality.

What this means

The pattern is clear: the world is transitioning from pure autoregressive to hybrid architectures. Models that can generate many tokens in parallel, speculatively predict and self-validate don't need to choose between speed and quality — they optimize both parameters simultaneously. Soon pure-autoregressive models may remain only for specialists who need absolute output stability.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation

NVIDIA introduced Nemotron-Labs-Diffusion: a model with triple decoding

Three modes in one architecture

Three sizes, three variants NVIDIA has released

What this means

Want to stop reading about AI and start using it?

The AI world, distilled — once a week