MarkTechPost→ original

NVIDIA introduced Nemotron-Labs-Diffusion: a model with triple decoding

NVIDIA introduced Nemotron-Labs-Diffusion, a language model that supports three decoding modes: autoregressive, diffusion, and speculative. The key result is a

NVIDIA introduced Nemotron-Labs-Diffusion: a model with triple decoding
Source: MarkTechPost. Collage: Hamidun News.
◐ Listen to article

NVIDIA has introduced Nemotron-Labs-Diffusion — a new family of language models that combines three methods of text generation within a single architecture. This engineering solution addresses the main bottleneck of modern LLMs: standard models generate text sequentially, one token after another, which limits processing speed and server throughput.

Three modes in one architecture

Nemotron-Labs-Diffusion supports three decoding modes simultaneously. The first is autoregressive (AR), classic like ChatGPT: the model looks at everything it has written so far and generates the next token. The second is diffusion-based parallel, when the model generates multiple tokens at once simultaneously, as if "drawing" text from both sides.

The third is speculative, where the model quickly predicts a block of tokens, then validates the predictions in a single pass. This hybrid approach allows selecting a mode depending on the task: real-time chat — use speculative (fast), batch document processing — use diffusion-based (parallel), audit or verification — use autoregressive (precise). Autoregressive: classic sequential generation, predictable Diffusion-based: parallel generation of multiple tokens at once * Self-speculation: efficient fast prediction with single-pass verification ## Six times faster on tokens The results speak for themselves.

With the same model size (8B parameters) Nemotron-Labs-Diffusion processes 6 times more tokens in a single forward pass than Qwen3-8B. This is a huge difference. For commercial applications this means either cheaper to serve users, or more users on a single server.

Important: this is not about the response speed to a single message, but about overall throughput. A server can process 6 sequences in parallel, instead of one.

Three sizes, three variants NVIDIA has released

Nemotron-Labs-Diffusion in three sizes: 3B (for edge devices and mobiles), 8B (balanced option) and 14B (for maximum quality and complexity). Each size is available immediately in three variants: base (base model), instruct (optimized for chatbots and instructions) and vision-language (works with images and text). This means a company can take an 8B model with vision support, and immediately have three generation modes plus multimodality.

What this means

The pattern is clear: the world is transitioning from pure autoregressive to hybrid architectures. Models that can generate many tokens in parallel, speculatively predict and self-validate don't need to choose between speed and quality — they optimize both parameters simultaneously. Soon pure-autoregressive models may remain only for specialists who need absolute output stability.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.
What do you think?
Loading comments…