NVIDIA Released Nemotron-Labs-TwoTower — Diffusion Language Model with Open Weights

NVIDIA released Nemotron-Labs-TwoTower — a diffusion language model with open weights based on the frozen AR backbone Nemotron-3-Nano-30B-A3B. The innovation…

Hamidun News Editorial

AI monitoring · MarkTechPost

Jul 4, 2026· 2 min

AI-processed from MarkTechPost; edited by Hamidun News

NVIDIA Released Nemotron-Labs-TwoTower — Diffusion Language Model with Open Weights — Source: MarkTechPost. Collage: Hamidun News.

◐ Listen to article

NVIDIA published Nemotron-Labs-TwoTower on July 1, 2026 — a diffusion-based language model with open weights, built on top of a frozen autoregressive backbone Nemotron-3-Nano-30B-A3B. The main goal of the release is to eliminate the systemic bottleneck of text generation, which limits the throughput of all autoregressive language systems. The model is distributed under the NVIDIA Nemotron Open Model License.

What limits autoregressive models

Autoregressive (AR) architectures are the foundation of GPT, Llama, Gemini, and most other LLMs. The principle is simple: each next token is predicted based on all previous tokens, and the process is strictly sequential. This is elegant at the training level, but creates a specific problem in industrial inference.

The next token cannot be computed until the previous one is finished. Adding GPU accelerators to the cluster does not lift this constraint — it is built into the architecture at the computational graph level: decoding is sequential by definition. When generating long answers, the user waits proportionally longer, and the cost per token at scale hits a hard floor. For providers processing billions of requests per day, this represents direct and ongoing operational costs. This is why inference acceleration is one of the major research directions in the industry alongside model size reduction and quantization.

Discrete diffusion language models offer an alternative mechanism: instead of step-by-step decoding, they iteratively refine the entire output block in several steps. This opens the potential for parallel generation of multiple tokens in a single pass — and therefore a fundamentally different throughput profile.

How TwoTower architecture works

The two-tower construction combines AR and diffusion approaches in a single model:

AR backbone: pretrained Nemotron-3-Nano-30B-A3B (30 billion parameters, frozen)
Diffusion head: trainable component on top of the frozen backbone
Open weights: NVIDIA Nemotron Open Model License
Release date: July 1, 2026

Freezing the AR backbone is a principled architectural decision. Instead of training a diffusion model from scratch, NVIDIA uses the pretrained AR foundation as an immutable source of contextual language representations. Only the diffusion component is trained, which reduces computational costs for experimentation and decreases the data needed for adaptation. The choice of Nemotron-3-Nano-30B-A3B as the backbone also facilitates reproducibility: other teams can replicate the experiment using the same publicly available checkpoint.

Why diffusion for text is a non-trivial task

Diffusion models have become the standard for image generation — Stable Diffusion, Midjourney, and DALL-E 3 operate on this principle. Adapting to text is fundamentally harder: pixels exist in a continuous numerical space, while tokens are discrete. Standard Gaussian noise is not applicable to discrete objects, so special discrete diffusion processes are being developed for text.

This direction is actively developing but remains young by industry standards. Previous work — MDLM, SEDD, and others — demonstrated competitive results on language benchmarks, yet the gap with the best AR systems in quality persisted. NVIDIA's two-tower approach is an attempt to resolve this trade-off: take the strong language representations of an already-trained AR model and add a diffusion mechanism to them without losing accumulated knowledge about syntax, semantics, and contextual dependencies.

Open weights are separately valuable for the academic community: researchers will be able to reproduce the architecture, measure real throughput gains on their own tasks, and propose improvements on top of the published checkpoint.

What it means

Nemotron-Labs-TwoTower is a practical step toward accelerating LLM inference without replacing hardware. NVIDIA, as a leading supplier of GPUs for the AI market, is interested in expanding the applicability of language models, including by reducing inference costs. If the hybrid AR+diffusion approach proves viable in real load scenarios — both in generation quality and real throughput acceleration — it could influence architectural decisions in the development of the next generation of language systems.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation