MarkTechPost→ original

Moonshot AI presented Attention Residuals — an alternative to residual connections in transformers

Moonshot AI presented Attention Residuals, a new way to combine signals between transformer layers not through a fixed sum, but through attention over depth…

AI-processed from MarkTechPost; edited by Hamidun News
Moonshot AI presented Attention Residuals — an alternative to residual connections in transformers
Source: MarkTechPost. Collage: Hamidun News.
◐ Listen to article

Moonshot AI has released Attention Residuals — an architectural update for transformers that changes one of the most fundamental elements of the model: residual connections. Instead of fixed summation of outputs from all previous layers, the team proposes depth-wise attention, allowing the model to decide which representations from the network's depth it actually needs.

Where the Bottleneck Is

In most modern LLMs, each new layer does not simply process input but adds its result to the overall hidden state. Such a scheme, inherited from the residual approach and especially popular in PreNorm architectures, helps train deep networks without gradient collapse. But it comes at a cost: all previous representations are mixed with equal weight, and their contribution becomes blurred over time.

As depth increases, hidden states grow approximately linearly, making early signals increasingly difficult to extract in useful form. Moonshot AI calls this the depth-wise aggregation problem: the model has learned well to select important tokens by sequence and route experts in MoE, but across network depth it still relies on fixed summation. The researchers draw a direct analogy with the RNN era: once, sequence was also compressed into a single state, until attention gave each token access to all previous steps.

Here, they propose to do almost the same thing, only across layers instead of time.

How AttnRes Works

In Attention Residuals, each layer receives not a sum of all previous outputs, but a weighted combination of prior representations through softmax-attention. The weight depends on the layer itself and the input data, so the network can amplify useful signals and suppress noise rather than inherit everything equally. In the practical variant, this uses a very lightweight mechanism: one learnable pseudo-query vector per layer.

Because of this, the idea looks not like a radical transformer restructuring, but as a relatively compact replacement for the familiar residual scheme. The full AttnRes version requires storing all previous states, so for large models Moonshot AI proposes Block AttnRes. Layers are divided into blocks, within which normal accumulation remains, while attention is applied between summary representations of blocks.

According to the team, a configuration with roughly eight blocks preserves most of the full version's gains, reduces memory requirements and communication from O(Ld) to O(Nd), and keeps additional inference latency below 2%.

What the Tests Showed

Moonshot AI tested the approach not only on scaling-law experiments but also on a large pretrained Kimi Linear model with 48 billion parameters, of which 3 billion are active, trained on 1.4 trillion tokens. The key thesis is this: Block AttnRes achieves the same loss function as the baseline model trained with a computational budget 1.25 times larger. That is, not cosmetic tuning, but potentially more favorable scaling.

  • GPQA-Diamond: 36.9 → 44.4
  • HumanEval: 59.1 → 62.2
  • MMLU: 73.5 → 74.6
  • C-Eval: 79.6 → 82.5
  • Inference latency: less than 2%

The learning dynamics are particularly important. In the report, the team notes that AttnRes mitigates the PreNorm dilution effect: the amplitude of hidden states does not disperse with depth, and gradient norms are distributed more evenly across layers. In practice, this means more controllable training and lower probability that part of the model's depth acts as expensive but weakly useful ballast. The most notable gains came in multi-step reasoning and code generation, making the work especially interesting for future LLMs and agentic systems.

What This Means

This is not a new chatbot or user feature, but an attempt to rewrite one of the basic building blocks of transformers. If Moonshot AI's results are confirmed on other architectures and in industrial stacks, the race for LLM quality will increasingly be driven not only by more data and GPUs, but by more intelligent internal mechanics of the models themselves.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…