How a dot product from an algebra textbook became the foundation of ChatGPT, Claude, and Gemini
In 2017, Google published "Attention is All You Need" — and neural networks were never the same. Today, ChatGPT, Claude, Gemini, and Midjourney run on…
AI-processed from Habr AI; edited by Hamidun News
In 2017, a group of Google engineers published the article "Attention is All You Need" — and it quietly divided the history of artificial intelligence into "before" and "after". Without fanfare or humanoid robots, eight researchers described an architecture that today underlies ChatGPT, Claude, Gemini, Midjourney, and virtually all modern generative AI. Most remarkably — the foundation of this revolution turned out to be one of the simplest operations in linear algebra: the dot product of two vectors.
The Wall That Neural Networks Hit
Before transformers, recurrent networks — RNNs and LSTMs — dominated text processing. They read sentences sequentially: word by word, step by step. The problem was that by the end of a long text, the model "forgot" the beginning — the signal faded as it passed through hundreds of intermediate steps. This was called the vanishing gradient problem. Imagine: a model reads a novel and by the fifth chapter has already forgotten the main character's name. This was exactly the wall neural networks hit by the mid-2010s. The architecture scaled poorly, parallel computing was practically unavailable — each next step depended on the previous one. Something fundamentally different was needed.
Why the Dot Product is Genius in Its Simplicity
The dot product is an operation from a standard vector algebra course. You take two vectors, multiply their coordinates pairwise, and add all the results. The output is one number: the larger it is, the more "similar" or "related" the vectors are to each other. In the self-attention mechanism of a transformer, this operation is applied as follows:
- Each word is transformed into three vectors: Query (request), Key (key), and Value (value)
- The dot product of one word's Query with the Key of every other word shows the "strength of connection" between a pair of words
- The results are scaled and normalized through softmax — resulting in attention weights from 0 to 1
- The final vector of a word is a weighted sum of all Values according to these weights
Essentially, each token simultaneously asks all others: "How important are you for my understanding right now?" — and receives a precise numerical answer. This happens in parallel for the entire sentence, not sequentially word by word.
Three Properties That Changed the Industry
The dot product turned out to be the ideal operation for language for several reasons at once.
Parallelism. All attention computations can be performed simultaneously — unlike RNNs, where each step depends on the previous one. This unlocked GPUs and TPUs and made it possible to scale models to hundreds of billions of parameters. This is how BERT, GPT-3, and then GPT-4 and Claude appeared over the course of several years.
Global context without fading. Each word immediately "sees" all others — regardless of distance in the text. The pronoun "he" at the end of a long paragraph correctly links to the character's name from the very beginning. No signal fading through intermediate layers.
Interpretability. Attention matrices can be visualized — literally see which word pays attention to what when processing a sentence. This is rare in the world of neural networks, where most decisions remain a black box.
"We don't need recurrence and convolutions at all," wrote the authors in 2017.
For the community at the time, this sounded like heresy. Three years later it became an axiom.
What This Means
The transformer proved: a revolution in AI can come not from neurobiology and not from philosophy of consciousness, but from a second-year linear algebra textbook. ChatGPT, Claude, Gemini, Midjourney — they all at their core multiply matrices of dot products billions of times a second. The simplicity of the operation turned out to be its main strength: not complication, but the right choice of an elementary tool changed everything.
Need AI working inside your business — not just in your newsfeed?
I build production AI for companies — custom CRM, internal tools, autonomous agents, workflow automation. Owned by you, shaped to your process, no per-seat tax. Built by Zhemal Khamidun, CPO of AlpinaGPT (AI platform, 6,000+ users).
The AI world, distilled — once a week
Seven stories that actually mattered, hand-picked. No noise, no reposts, no press releases.
Done! Check your inbox for a confirmation.