The geometry of attention: how QK Norm teaches the model to understand meaning
QK Norm in transformers is not just stabilization. Normalization turns the dot product into cosine similarity, forcing the network to express meaning through th

QK Norm — normalization of query and key before the dot product in the attention mechanism — is often perceived as a technical detail for numerical stability. In reality, it is a deep geometric constraint that forces the transformer to express meaning in a completely different way.
Problem without normalization
Network layers are lazy. When there are no norms, instead of cleverly rotating vectors and expressing meaning through angular relationships, the transformer takes the simple path: increases magnitude. An important token simply becomes louder. It does this because it can. The dot product q⃗ · k⃗ = |q⃗| |k⃗| cos(θ) grows through both angle and magnitude. The network learns both: looks at the angle and inflates the vector. And inflating is cheaper than understanding. This leads to "attention sink" — when grammatical function tokens (commas, articles, pronouns) begin to dominate because they occur frequently and will accumulate large magnitude just like that.
How QK Norm works
Normalization is a constraint. When we normalize query and key to unit length, the dot product becomes cosine similarity. Vectors are "locked" on an N-dimensional hypersphere. Now the network cannot inflate the vector to attract attention. There is only one option left: rotate the vector so that its angle with other vectors expresses the needed semantic relationship. If you need a connection between words — show it with an angle, not loudness. This fundamentally changes how internal representations work:
- Meaning is encoded by angles between vectors, not their magnitude
- All tokens are at equal distance from the origin
- Rare words are not lost because on a hypersphere all are equal
- Attention is based on semantic similarity, not frequency
Correct place in architecture
One critical detail: QK Norm should come before RoPE (Rotary Position Embedding), not after. If normalization is applied after RoPE, positional information becomes blurred and is not encoded correctly. Order: normalize query-key → then RoPE.
What it means
This is not just an engineering trick for numerical stability — it is a redefinition of what "attention" means at the geometric level. A network that relies on angles instead of magnitudes learns to generalize better on rare tokens and avoids the habit of being dominated by function words. Without empirical results on billion-parameter models, this remains a theoretical argument, but the geometric logic is sound.
Хотите не читать про ИИ, а внедрить его?
«AI News» — это полезные новости из мира ИИ. Системно научиться работать с нейросетями и применять их в работе — в Hamidun Academy.