Business

MMLU

MMLU (Massive Multitask Language Understanding) is a benchmark evaluating language models on approximately 14,000 multiple-choice questions across 57 academic subjects — from calculus and medicine to law and ethics — providing a standardized measure of broad knowledge and reasoning.

MMLU (Massive Multitask Language Understanding) is a language model evaluation benchmark introduced by Dan Hendrycks and colleagues at UC Berkeley in a 2020 paper. It comprises approximately 14,042 four-choice questions drawn from 57 subject areas spanning STEM, humanities, law, social sciences, and professional domains including medicine and accounting. Questions are sourced from real academic exams, standardized tests, and textbooks, targeting knowledge that a broadly educated expert is expected to possess.

Performance is reported as the percentage of correct answers, typically in a 5-shot setting where the model receives five example question-answer pairs before each test question. A random baseline is 25%; non-expert human performance is roughly 34%; estimated domain-expert human performance is approximately 89–90%. GPT-3 (175B) scored around 43% at its 2020 launch. Rapid capability growth followed: GPT-4 exceeded 86% in 2023, and multiple frontier models including Gemini Ultra, Claude 3 Opus, and Llama 3 405B subsequently scored in the 85–90% range.

MMUL's breadth made it the dominant benchmark for cross-organization model comparison for several years. It revealed clear capability jumps tied to model scale and training improvements and was routinely cited in model release reports as a primary general-capability signal. However, it has faced significant criticism: evidence of training-data contamination (test questions appearing in pretraining corpora), ceiling effects among top models, and concerns that multiple-choice performance may reflect surface-level pattern matching rather than genuine reasoning.

As of 2026, MMLU retains its role as a historical reference and comparative baseline but is increasingly supplemented by harder variants such as MMLU-Pro, which uses more difficult distractors and requires multi-step reasoning, and by benchmarks such as GPQA and ARC-AGI for discriminating among frontier models. Its chief remaining value is providing a common scale against which older and newer models can be positioned.

Example

A research team comparing three open-weight models against GPT-4 reports each model's 5-shot MMLU accuracy alongside task-specific scores; a model scoring 88% is described as falling within the GPT-4 class range on broad academic knowledge.

Related terms

← Glossary