Habr published a brief guide to attention: self-attention, cross-attention, and multi-head
Habr has published a clear mini-reference on attention in transformers. The text briefly explains how Q, K, and V are obtained from embeddings, why the dot…
AI-processed from Habr AI; edited by Hamidun News
Habr published a brief guide to attention: self-attention, cross-attention, and multi-head
Habr published a brief breakdown of the attention mechanism — the basic idea that transformers and modern LLMs are built on. The material shows without unnecessary padding how a model chooses which tokens to trust more, then explains it with a simple numerical example and code in PyTorch.
How attention works
The author starts with a basic definition: attention allows a neural network to not process all parts of the input equally, but to dynamically decide what is more important for the current task. To do this, the input sequence is transformed into three sets of representations — Query, Key, and Value. Next, the model compares the query of each token with the keys of all other tokens, obtains weights of importance, and on their basis assembles a new contextual vector. This is the main trick: the value of a word or element depends no longer only on itself, but on the entire environment. The material walks through the entire computation chain step by step:
- embeddings are used to build Q, K, and V matrices
- then token similarity is calculated through dot product
- the result is scaled by dividing by the square root of dimensionality
- after softmax, attention weights are obtained
- the final output is the weighted sum of the V matrix
Separately explained is why division by sqrt(d_k) is needed. As the dimensionality of vectors grows, dot products become too large, softmax quickly saturates, and gradients begin to fade. Scaling keeps computations in a more stable range and makes training more predictable. For beginners, this is a useful emphasis: in most explanations, the formula is presented as given, but here it's shown what problem it actually solves.
Example with tokens
The most understandable part of the text is a toy example with the phrase "Karina goes to the store." The author simplifies the task to four tokens and two-dimensional embeddings to avoid drowning in matrices, then manually walks through all the steps: tokenization, adding positional information, building the X matrix, computing QK^T, scaling, softmax, and the final multiplication by V. Because of this, attention stops looking like magic from a formula and becomes an ordinary sequence of vector operations.
After normalization, you can see how attention is distributed. For the token "Karina," the model in the example takes about 31% of information from the word itself, 15% from the word "goes," and the rest from other tokens in the sentence. At the output, the original embeddings are transformed into new, now contextual representations.
This is an important moment for understanding transformers: the model does not store a fixed value of a word once and for all, but reassembles it anew in each context.
"Each vector after self-attention no longer describes the word by itself."
Other types of attention
In the second half of the article, the author moves to two extensions of the basic scheme. Cross-attention is described as a mode in which Query is taken from one sequence, while Key and Value are taken from another. In practice, this is convenient to think of as a mechanism that allows a decoder to refer to the encoder's context.
The formula hardly changes, but the source of data changes: the model compares the current query not with itself, but with an external context. This is a key block for translators, multimodal systems, and many encoder-decoder architectures. Next, multi-head attention is discussed.
Instead of one attention, the model runs several "heads" in parallel, and each learns to look at the sequence from its own angle: one can better catch local connections, another — distant dependencies, a third — syntax or semantic roles. Then the results of the heads are concatenated and pass through another linear transformation. The article also includes minimal implementations of all three variants in PyTorch: self-attention, cross-attention, and multi-head self-attention, so the text works not only as theory but also as a starting cheat sheet for practice.
What it means
For those just getting into the topic of transformers, this is a successful introductory material: it doesn't overload with proofs, but honestly walks you through the math, examples, and code. And for practitioners, it's a reminder that behind the "magic" of LLMs stand quite concrete operations with weights, matrices, and context — and understanding them is useful if you work with models not just as a user.
Need AI working inside your business — not just in your newsfeed?
I build production AI for companies — custom CRM, internal tools, autonomous agents, workflow automation. Owned by you, shaped to your process, no per-seat tax. Built by Zhemal Khamidun, CPO of AlpinaGPT (AI platform, 6,000+ users).
The AI world, distilled — once a week
Seven stories that actually mattered, hand-picked. No noise, no reposts, no press releases.
Done! Check your inbox for a confirmation.