Mixture of Experts (MoE)
Mixture of Experts (MoE) is a neural network architecture where a learned routing mechanism activates only a small subset of specialized sub-networks (experts) for each input token, allowing large total parameter counts without proportional compute cost per token.
Mixture of Experts (MoE) is a neural network design in which the model contains a large set of parallel sub-networks called experts, and a lightweight gating or routing mechanism selects only a small number of them—commonly one or two—to process each input token. Because only a fraction of the model's total parameters are active during any given forward pass, an MoE model can encode substantially more knowledge than a same-compute dense model. The concept originates in work by Jacobs et al. in 1991, and was adapted to large-scale transformers in Google's Sparsely-Gated MoE paper (2017) and the Switch Transformer (2021).
In a standard transformer MoE layer, each feed-forward network block is replaced by a set of N expert feed-forward networks. A router network—a small linear layer—examines each token's representation and outputs scores over all experts; the top-k experts (typically k=1 or k=2) are selected, their outputs are computed, and a weighted sum is returned. Auxiliary loss terms penalize uneven routing to prevent all tokens from collapsing onto the same popular expert, a pathology called expert collapse. During training, gradients flow only through the selected experts for each token, so effective parameter update cost scales with k rather than N. The main engineering challenges are load balancing across experts and, in distributed training, the all-to-all communication required when experts reside on different accelerators.
MoE matters because it decouples model capacity from per-token compute cost. A dense model must engage all its parameters for every token; an MoE model routes different input types to different specialists, achieving both scale and efficiency. A useful illustration: Mistral's Mixtral 8x7B has approximately 47 billion total parameters but activates roughly 13 billion per token, delivering performance comparable to dense models twice its active size at similar inference compute. This trade-off is particularly attractive at serving scale, where per-token latency and memory bandwidth are the primary cost drivers.
By 2026, MoE has become a mainstream production architecture across model families. Mistral AI's Mixtral 8x7B and 8x22B (released in late 2023 and early 2024) popularized open-weight MoE. Google's Gemini 1.5 Pro and Flash are reported to use MoE, and credible reporting indicates GPT-4 also employs a mixture-of-experts design. Meta's LLaMA 4 Scout and Maverick models (released in early 2026) are MoE architectures with 17 billion active parameters out of much larger total counts. Variants such as mixture-of-depths—which skip transformer layers entirely for easy tokens—extend the principle of conditional computation beyond just the feed-forward block.