Habr published a brief guide to attention: self-attention, cross-attention, and multi-head

Habr has published a clear mini-reference on attention in transformers. The text briefly explains how Q, K, and V are obtained from embeddings, why the dot…

Hamidun News Editorial

AI monitoring · Habr AI

May 2, 2026· 3 min

AI-processed from Habr AI; edited by Hamidun News

◐ Listen to article

Habr published a brief guide to attention: self-attention, cross-attention, and multi-head

Habr published a brief breakdown of the attention mechanism — the basic idea that transformers and modern LLMs are built on. The material shows without unnecessary padding how a model chooses which tokens to trust more, then explains it with a simple numerical example and code in PyTorch.

How attention works

The author starts with a basic definition: attention allows a neural network to not process all parts of the input equally, but to dynamically decide what is more important for the current task. To do this, the input sequence is transformed into three sets of representations — Query, Key, and Value. Next, the model compares the query of each token with the keys of all other tokens, obtains weights of importance, and on their basis assembles a new contextual vector. This is the main trick: the value of a word or element depends no longer only on itself, but on the entire environment. The material walks through the entire computation chain step by step:

embeddings are used to build Q, K, and V matrices
then token similarity is calculated through dot product
the result is scaled by dividing by the square root of dimensionality
after softmax, attention weights are obtained
the final output is the weighted sum of the V matrix

Separately explained is why division by sqrt(d_k) is needed. As the dimensionality of vectors grows, dot products become too large, softmax quickly saturates, and gradients begin to fade. Scaling keeps computations in a more stable range and makes training more predictable. For beginners, this is a useful emphasis: in most explanations, the formula is presented as given, but here it's shown what problem it actually solves.

Example with tokens

The most understandable part of the text is a toy example with the phrase "Karina goes to the store." The author simplifies the task to four tokens and two-dimensional embeddings to avoid drowning in matrices, then manually walks through all the steps: tokenization, adding positional information, building the X matrix, computing QK^T, scaling, softmax, and the final multiplication by V. Because of this, attention stops looking like magic from a formula and becomes an ordinary sequence of vector operations.

After normalization, you can see how attention is distributed. For the token "Karina," the model in the example takes about 31% of information from the word itself, 15% from the word "goes," and the rest from other tokens in the sentence. At the output, the original embeddings are transformed into new, now contextual representations.

This is an important moment for understanding transformers: the model does not store a fixed value of a word once and for all, but reassembles it anew in each context.

"Each vector after self-attention no longer describes the word by itself."

Other types of attention

In the second half of the article, the author moves to two extensions of the basic scheme. Cross-attention is described as a mode in which Query is taken from one sequence, while Key and Value are taken from another. In practice, this is convenient to think of as a mechanism that allows a decoder to refer to the encoder's context.

The formula hardly changes, but the source of data changes: the model compares the current query not with itself, but with an external context. This is a key block for translators, multimodal systems, and many encoder-decoder architectures. Next, multi-head attention is discussed.

Instead of one attention, the model runs several "heads" in parallel, and each learns to look at the sequence from its own angle: one can better catch local connections, another — distant dependencies, a third — syntax or semantic roles. Then the results of the heads are concatenated and pass through another linear transformation. The article also includes minimal implementations of all three variants in PyTorch: self-attention, cross-attention, and multi-head self-attention, so the text works not only as theory but also as a starting cheat sheet for practice.

What it means

For those just getting into the topic of transformers, this is a successful introductory material: it doesn't overload with proofs, but honestly walks you through the math, examples, and code. And for practitioners, it's a reminder that behind the "magic" of LLMs stand quite concrete operations with weights, matrices, and context — and understanding them is useful if you work with models not just as a user.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Need AI working inside your business — not just in your newsfeed?

I build production AI for companies — custom CRM, internal tools, autonomous agents, workflow automation. Owned by you, shaped to your process, no per-seat tax. Built by Zhemal Khamidun, CPO of AlpinaGPT (AI platform, 6,000+ users).

Book a free consultation →

Habr published a brief guide to attention: self-attention, cross-attention, and multi-head

How attention works

Example with tokens

Other types of attention

What it means

Need AI working inside your business — not just in your newsfeed?

The AI world, distilled — once a week