The geometry of attention: how QK Norm teaches the model to understand meaning

QK Norm in transformers is not just stabilization. Normalization turns the dot product into cosine similarity, forcing the network to express meaning through the angles between vectors rather than their magnitude. Result: the model handles rare words better and avoids 'attention sink'.

Khamidun Zhemal

AI monitoring · Habr AI

May 17, 2026· 3 min

AI-processed from Habr AI; edited by Hamidun News

The geometry of attention: how QK Norm teaches the model to understand meaning — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

QK Norm — normalization of query and key before the dot product in the attention mechanism — is often perceived as a technical detail for numerical stability. In reality, it is a deep geometric constraint that forces the transformer to express meaning in a completely different way.

Problem without normalization

Network layers are lazy. When there are no norms, instead of cleverly rotating vectors and expressing meaning through angular relationships, the transformer takes the simple path: increases magnitude. An important token simply becomes louder. It does this because it can. The dot product q⃗ · k⃗ = |q⃗| |k⃗| cos(θ) grows through both angle and magnitude. The network learns both: looks at the angle and inflates the vector. And inflating is cheaper than understanding. This leads to "attention sink" — when grammatical function tokens (commas, articles, pronouns) begin to dominate because they occur frequently and will accumulate large magnitude just like that.

How QK Norm works

Normalization is a constraint. When we normalize query and key to unit length, the dot product becomes cosine similarity. Vectors are "locked" on an N-dimensional hypersphere. Now the network cannot inflate the vector to attract attention. There is only one option left: rotate the vector so that its angle with other vectors expresses the needed semantic relationship. If you need a connection between words — show it with an angle, not loudness. This fundamentally changes how internal representations work:

Meaning is encoded by angles between vectors, not their magnitude
All tokens are at equal distance from the origin
Rare words are not lost because on a hypersphere all are equal
Attention is based on semantic similarity, not frequency

Correct place in architecture

One critical detail: QK Norm should come before RoPE (Rotary Position Embedding), not after. If normalization is applied after RoPE, positional information becomes blurred and is not encoded correctly. Order: normalize query-key → then RoPE.

What it means

This is not just an engineering trick for numerical stability — it is a redefinition of what "attention" means at the geometric level. A network that relies on angles instead of magnitudes learns to generalize better on rare tokens and avoids the habit of being dominated by function words. Without empirical results on billion-parameter models, this remains a theoretical argument, but the geometric logic is sound.

Hamidun News

AI news without noise. Daily editorial selection from 50+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation