Business

GPQA

GPQA (Graduate-Level Google-Proof Q&A) is a benchmark of 448 expert-written multiple-choice questions in biology, chemistry, and physics, designed so that non-specialists cannot answer them correctly even with unrestricted web access, making it a test of genuine scientific reasoning.

GPQA (Graduate-Level Google-Proof Q&A) is a challenging evaluation benchmark introduced by David Rein and colleagues — including researchers affiliated with Anthropic and NYU — in a 2023 paper. It contains 448 multiple-choice questions across graduate-level biology, chemistry, and physics, each written by verified domain experts and subjected to a rigorous validation process. A question is included only if domain experts who are not the author still answer it correctly at a meaningful rate, while non-experts with unrestricted internet access answer it correctly no more than roughly 34% of the time. GPQA Diamond, a 198-question subset of the most challenging items, is the version most commonly reported in model evaluations.

The "google-proof" design is the benchmark's defining feature. Questions require multi-step scientific reasoning that cannot be resolved by retrieving a matching passage online; a representative chemistry question might require applying quantum mechanical principles to predict spectroscopic properties from first principles. Measured accuracy on GPQA Diamond is approximately 65% for domain experts (PhD-level scientists in the relevant field), with non-experts scoring near the random baseline of 25%.

GPQA became important after standard benchmarks like MMLU began to saturate among frontier models. It serves as a signal of whether a model has internalized genuine domain-expert reasoning rather than surface statistical patterns, and scores on GPQA Diamond correlate broadly with a model's capacity for multi-step scientific problem solving. The benchmark is widely cited in technical model release reports from major AI laboratories.

As of 2026, leading reasoning-focused models — including OpenAI's o1 and o3 series and Anthropic's Claude 3.7 and Claude 4 family — score significantly above the 65% human expert baseline on GPQA Diamond, with top models approaching 80–90%. This rapid progress is pushing the community toward harder follow-up evaluations, though GPQA remains a standard touchstone due to its clean design and strong resistance to shallow memorization.

Example

A biotech company vetting AI assistants for drug-discovery research uses GPQA Diamond to separate models with genuine biochemistry reasoning from those that retrieve surface-level answers, treating 70% accuracy as a minimum deployment threshold.

Related terms

← Glossary