Models

Transformer

A Transformer is a neural network architecture centered on self-attention, which lets every position in a sequence directly attend to every other position simultaneously, enabling fully parallel training and effective modeling of long-range dependencies in language, images, and other sequences.

The Transformer architecture was introduced by Vaswani et al. at Google Brain in the 2017 paper 'Attention Is All You Need.' Its central innovation is multi-head self-attention: for each token in an input sequence, the model computes a weighted sum of representations from all other tokens, where the weights (attention scores) reflect learned relevance between token pairs. Running multiple attention heads in parallel allows the model to capture different types of relationships simultaneously. Because attention is computed over the entire sequence at once rather than step-by-step, the Transformer is fully parallelizable across the sequence length, in contrast to recurrent networks (RNNs, LSTMs) that must process tokens sequentially.

A standard Transformer encoder or decoder consists of stacked identical layers, each containing a multi-head self-attention sublayer, a position-wise feed-forward network, residual (skip) connections around each sublayer, and layer normalization. Since self-attention is inherently position-agnostic, positional encodings (fixed sinusoidal or learned embeddings) are added to token representations to inject sequence order. The key architectural hyperparameters — number of layers, hidden dimension, number of attention heads, and feed-forward width — together with training data volume and compute budget, largely determine model capability.

Transformers became dominant in NLP with BERT (Google, 2018) and GPT-2 (OpenAI, 2019), then demonstrated strong transfer to computer vision with the Vision Transformer (ViT, Google, 2020), and to structural biology with AlphaFold2's use of attention over amino-acid sequences and pair representations (DeepMind, 2020). As of 2026, every major large language model — GPT-4, Claude 3, Gemini 1.5, Llama 3, Mistral — is a Transformer variant, as are leading vision encoders and multimodal models combining text, images, audio, and video.

The main practical limitation of standard Transformer self-attention is quadratic computational and memory cost with respect to sequence length, making very long contexts expensive. By 2026, multiple techniques address this: sparse attention patterns, linear attention approximations, sliding-window attention (Mistral), and hardware-optimized exact attention (FlashAttention). Context windows exceeding one million tokens are supported by several production models. Research on architectural successors — state-space models such as Mamba, and hybrid attention/SSM architectures — aims to combine the modeling quality of Transformers with sub-quadratic scaling.

Example

When a user submits a 50,000-token technical document to Claude for summarization, the underlying Transformer processes the entire document in a single forward pass with every sentence attending directly to every other sentence, producing a coherent summary without the information loss that sequential truncation would cause.

Related terms

Latest news on this topic

← Glossary