NVIDIA Introduces Gated DeltaNet-2: Linear Attention with Separate Memory Gates

Q: What is the source?

Originally published on MarkTechPost. Hamidun News processes and adapts the material with AI.

Q: When was it published?

2026-05-25. Reading time: 4 min.

NVIDIA introduced Gated DeltaNet-2 — a new linear attention mechanism for large language models. The key difference: instead of a single scalar gate, the new ar

Hamidun News Editorial

AI monitoring · MarkTechPost

2026-05-25· 3 min

AI-processed from MarkTechPost; edited by Hamidun News

NVIDIA Introduces Gated DeltaNet-2: Linear Attention with Separate Memory Gates — Source: MarkTechPost. Collage: Hamidun News.

◐ Listen to article

NVIDIA has introduced a new linear attention mechanism called Gated DeltaNet-2, which significantly improves memory management in large language models. The main difference is separate management of erasing old data and writing new data, instead of a single scalar gate used in previous generations.

Problem with Memory in Linear Models

Linear attention mechanisms solve a critical problem with transformers: they compress an unlimited KV-cache into a fixed recurrent state. This allows more efficient processing of long texts and significantly reduces memory consumption, which is critical for practical applications and devices with limited resources. However, there is a serious catch: editing memory without disrupting existing connections is an extremely difficult task. Models need to simultaneously learn new facts and preserve old knowledge. Add new information and you risk overwriting important associations. Forget the old and you lose context. This is the classic conflict between learning and retention.

Previous models like Gated DeltaNet and KDA used a single scalar gate to manage both processes simultaneously: erasing old data and writing new data. This creates an unsolvable conflict: one lever cannot efficiently perform two contradictory tasks. The result is that model quality suffers, performance on complex tasks declines.

How DeltaNet-2 Redesigned the Architecture

NVIDIA decided to radically redesign the memory management system. Instead of a single scalar gate, Gated DeltaNet-2 uses two independent channel-wise gates:

Erase gate b_t on the key axis — manages deletion of outdated information
Write gate w_t on the value axis — controls addition of new data
Each gate operates at the channel level (channel-wise), not as a single scalar for all memory
This allows the model to balance more flexibly between forgetting and learning
The architecture contains 1.3B parameters, trained on 100B tokens

This separation allows the model to understand: when to release old information and when to carefully preserve and update existing connections in memory. Each memory channel can make independent decisions, which significantly increases the flexibility and adaptability of the model to different types of data and complex tasks. As a result, the model can work with longer text sequences without loss of quality. Memory becomes not just a data store, but an intelligent system that knows what to forget and what to keep.

Impressive Results on Benchmarks

On official tests, Gated DeltaNet-2 showed a noticeable advantage over competitors:

Outperformed Mamba-2 on standard language modeling tasks
Surpassed the original Gated DeltaNet and KDA in overall performance
Showed better results than Mamba-3 on long context tasks
On RULER S-NIAH (needle in haystack search) has the most impressive improvements
On multi-key needle retrieval shows practically critical improvement

Particularly noteworthy are the results on commonsense reasoning tasks. This is not just language modeling, but logical understanding of relationships between concepts. Separate memory management improves not only computation speed, but also the quality of understanding logical connections — a signal that architectural decisions deeply influence model intelligence.

What This Means

Gated DeltaNet-2 demonstrates an important principle: the efficiency of linear attention mechanisms depends not on the idea of linearity itself, but on the architectural details of its implementation. When engineers correctly separate functions (erasing vs. writing), the system becomes both faster and smarter. In practice, this means: models will be able to process documents with hundreds of thousands of tokens without loss of quality. This opens new possibilities for applications requiring long context — from intelligent search through large text databases to complex dialogue systems that need to remember the entire conversation history.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation