NVIDIA Introduces Gated DeltaNet-2: Linear Attention with Separate Memory Gates
NVIDIA introduced Gated DeltaNet-2 — a new linear attention mechanism for large language models. The key difference: instead of a single scalar gate, the new ar
AI-processed from MarkTechPost; edited by Hamidun News
NVIDIA has introduced a new linear attention mechanism called Gated DeltaNet-2, which significantly improves memory management in large language models. The main difference is separate management of erasing old data and writing new data, instead of a single scalar gate used in previous generations.
Problem with Memory in Linear Models
Linear attention mechanisms solve a critical problem with transformers: they compress an unlimited KV-cache into a fixed recurrent state. This allows more efficient processing of long texts and significantly reduces memory consumption, which is critical for practical applications and devices with limited resources. However, there is a serious catch: editing memory without disrupting existing connections is an extremely difficult task. Models need to simultaneously learn new facts and preserve old knowledge. Add new information and you risk overwriting important associations. Forget the old and you lose context. This is the classic conflict between learning and retention.
Previous models like Gated DeltaNet and KDA used a single scalar gate to manage both processes simultaneously: erasing old data and writing new data. This creates an unsolvable conflict: one lever cannot efficiently perform two contradictory tasks. The result is that model quality suffers, performance on complex tasks declines.
How DeltaNet-2 Redesigned the Architecture
NVIDIA decided to radically redesign the memory management system. Instead of a single scalar gate, Gated DeltaNet-2 uses two independent channel-wise gates:
- Erase gate b_t on the key axis — manages deletion of outdated information
- Write gate w_t on the value axis — controls addition of new data
- Each gate operates at the channel level (channel-wise), not as a single scalar for all memory
- This allows the model to balance more flexibly between forgetting and learning
- The architecture contains 1.3B parameters, trained on 100B tokens
This separation allows the model to understand: when to release old information and when to carefully preserve and update existing connections in memory. Each memory channel can make independent decisions, which significantly increases the flexibility and adaptability of the model to different types of data and complex tasks. As a result, the model can work with longer text sequences without loss of quality. Memory becomes not just a data store, but an intelligent system that knows what to forget and what to keep.
Impressive Results on Benchmarks
On official tests, Gated DeltaNet-2 showed a noticeable advantage over competitors:
- Outperformed Mamba-2 on standard language modeling tasks
- Surpassed the original Gated DeltaNet and KDA in overall performance
- Showed better results than Mamba-3 on long context tasks
- On RULER S-NIAH (needle in haystack search) has the most impressive improvements
- On multi-key needle retrieval shows practically critical improvement
Particularly noteworthy are the results on commonsense reasoning tasks. This is not just language modeling, but logical understanding of relationships between concepts. Separate memory management improves not only computation speed, but also the quality of understanding logical connections — a signal that architectural decisions deeply influence model intelligence.
What This Means
Gated DeltaNet-2 demonstrates an important principle: the efficiency of linear attention mechanisms depends not on the idea of linearity itself, but on the architectural details of its implementation. When engineers correctly separate functions (erasing vs. writing), the system becomes both faster and smarter. In practice, this means: models will be able to process documents with hundreds of thousands of tokens without loss of quality. This opens new possibilities for applications requiring long context — from intelligent search through large text databases to complex dialogue systems that need to remember the entire conversation history.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.