Models

Decoder-Only Architecture

A Decoder-Only architecture is a transformer variant that uses a single self-attention stack with causal (left-to-right) masking to predict each next token from its preceding context, without a separate encoder, and is the dominant design for large language models.

In a decoder-only transformer, the entire input — both prompt and generated output — is treated as a unified token sequence. Each layer applies masked self-attention, where each position can attend only to positions before it (causal masking), ensuring the model cannot observe future tokens during next-token prediction. This contrasts with encoder–decoder designs, which use a separate bidirectional encoder and a causal decoder, and with BERT-style encoder-only models, which apply bidirectional attention without any generative constraint.

The architecture is trained with a single objective: given a sequence of tokens, predict the next token at every position. This autoregressive language modeling task is fully self-supervised — no labeled data is required, only raw text — allowing training on web-scale corpora. The GPT series popularized the design: GPT-1 (OpenAI, 2018) demonstrated transfer learning from pretraining, GPT-2 (2019) showed that scale produced surprisingly coherent generation, and GPT-3 (2020, 175 billion parameters) established that very large decoder-only models develop broad in-context learning abilities without any gradient updates.

Decoder-only models dominate modern LLMs because the design is simpler (one stack, one objective), scales predictably with parameter count and data, and handles both understanding and generation within a single forward pass through prompt conditioning. All major frontier models as of 2026 — GPT-4 and GPT-4o (OpenAI), the Claude 3 and Claude 4 series (Anthropic), Gemini 1.5 and 2.0 (Google DeepMind), LLaMA 3 (Meta), and Mistral — use decoder-only architectures.

Despite its prevalence, the decoder-only design has limitations: causal masking means each token only attends to prior tokens even when bidirectional context would be beneficial, as in classification tasks where encoder-only models of the same size can outperform it. Current research explores prefix attention (bidirectional attention over the prompt followed by causal decoding), mixture-of-experts decoder layers (as in GPT-4 and Mixtral), and speculative decoding to accelerate the inherently sequential generation step.

Example

When a user sends a prompt to an LLM such as GPT-4 or Claude, the decoder-only model processes the full prompt and generates a response by sampling one token at a time, each new token attending causally to all previous tokens within the context window.

← Glossary

Decoder-Only Architecture

Example

Related terms