Models

Encoder–Decoder Architecture

An Encoder–Decoder architecture is a neural network design in which an encoder maps an input into a latent representation and a separate decoder generates the output sequence from that representation, making it well-suited for tasks like translation and summarization.

In the original sequence-to-sequence formulation (Sutskever et al., 2014), the encoder is an RNN that reads the full input and produces a fixed context vector; the decoder is a second RNN that generates the output token by token, conditioned on that vector. The attention mechanism (Bahdanau et al., 2015) generalized this by allowing the decoder to selectively attend to all encoder hidden states at each decoding step, solving the information bottleneck of a fixed-size context vector and substantially improving performance on long sequences.

The transformer (Vaswani et al., 2017) recast the encoder–decoder structure using self-attention and cross-attention. The encoder stack processes the full input sequence in parallel, producing contextualized representations; the decoder stack generates output tokens autoregressively, using causal self-attention over its own past outputs and cross-attention over the encoder's output. This design scales efficiently and underpins models including T5 (Google, 2019), BART (Meta, 2019), and mT5 for multilingual settings.

The architecture is particularly well-suited to tasks where input and output are structurally different: machine translation (source and target languages), abstractive summarization (article to summary), speech recognition (OpenAI's Whisper uses a CNN encoder for audio and a transformer decoder for text), and image captioning (visual encoder, text decoder). The encoder handles understanding; the decoder handles generation, and this separation of concerns is beneficial when the two processes require different attention patterns.

As of 2026, encoder–decoder models remain the standard for structured prediction tasks. T5-style models and their multilingual successors are deployed in Google Search, document translation, and summarization pipelines. However, for open-ended language generation at scale, decoder-only models have largely taken over because they are architecturally simpler to train and scale. Parameter-efficient fine-tuning techniques frequently couple a frozen encoder with a small decoder head for low-resource adaptation.

Example

Google Translate's neural backend uses an encoder–decoder transformer that encodes a Spanish sentence into contextualized representations and then decodes those representations token by token into English.

Related terms

← Glossary