Humanity's Last Exam: Why the Top AI Benchmark from CAIS Is Called a Distraction
Humanity's Last Exam — 3,000 PhD-level questions from Center for AI Safety and Scale AI — became the most complex AI benchmark of 2025. Top models score less…
AI-processed from KDnuggets; edited by Hamidun News
The Humanity's Last Exam (HLE) benchmark has become one of the most discussed AI evaluation tools since its publication in January 2025 — and simultaneously one of the most criticized. KDnuggets analysts collected a spectrum of expert opinions and concluded: the test is more likely to distract the community from what matters than to provide a useful benchmark.
What is Humanity's Last Exam
HLE was created jointly by the nonprofit Center for AI Safety (CAIS) and Scale AI. The benchmark contains 3000 PhD-level questions across more than 100 academic disciplines: mathematics, molecular biology, classical languages, history of science, and dozens of other fields. Questions were compiled and verified by hundreds of professors and graduate students worldwide.
Key parameters:
- Release date — January 2025
- Authors — Center for AI Safety and Scale AI
- Volume — 3000 questions, 100+ disciplines
- Best result at launch — approximately 18% for OpenAI o3
- GPT-4o — approximately 3%, Claude 3.5 Sonnet — approximately 8%
- Questions were compiled and verified by hundreds of scientists
The authors pursued an understandable goal: to demonstrate that current models still fall far short of the level of human experts in the most complex cognitive tasks. In 2024–2025, public AI demonstrations often created the illusion of imminent AGI — HLE became a counterargument: "look how far we still have to go."
Why HLE is called a distraction
The main criticism from detractors is irrelevance. The test checks knowledge of rare academic facts: little-known theorems from two centuries ago, exact quotes from Sanskrit texts, specific biochemical reactions. A model's low score on such a test does not mean it writes code poorly, analyzes data poorly, synthesizes research poorly, or helps poorly with medical diagnosis.
The second argument is Goodhart's Law, well known in science: once a metric becomes a goal, it ceases to be a reliable measure. If leading AI labs begin — explicitly or implicitly — to optimize models for HLE, scores will rise without real growth in product usefulness. This is exactly what happened with MMLU and a number of other benchmarks before it.
"We need tests that measure how much AI helps me work better — not how
well it knows academic obscurities."
The third layer of criticism concerns transparency: HLE questions are classified, making independent reproduction of results and external audit extremely difficult.
What HLE supporters say
Defenders of the benchmark appeal to its original intent: HLE did not claim to measure product utility. Its task is to measure the ceiling of current systems in cognitively complex areas where human expertise has not yet been reproduced. From this perspective, the test succeeded: it tempered some of the hype and provided journalists, investors, and regulators with a clear argument against premature declarations of AGI.
Moreover, the creators point out: extremely difficult tests create a "safety margin." When models begin to score 50–70% on HLE, this will be a genuine warning signal — not marketing noise.
What this means
Humanity's Last Exam fulfilled its first task — it showed the limits of current AI systems in academically complex tasks. But as a long-term progress benchmark, it raises justified doubts: optimizing for academic obscurities does not lead to real utility. Useful AI progress assessment requires benchmarks that test real scenarios — code writing, data analysis, medical assistance, legal analysis. As long as benchmark selection remains academic, the discussion about "true AI capability" risks spinning in its own vacuum.
Frequently Asked Questions
What result did OpenAI o3 achieve on Humanity's Last Exam?
According to the January 2025 release, OpenAI o3 scored approximately 18% correct answers — the best result among tested models at the time of publication. Most other top systems, including GPT-4o and Claude 3.5 Sonnet, remained in the 3–8% range.
Who created the HLE benchmark and why?
The benchmark was developed jointly by Center for AI Safety (CAIS) and Scale AI. The authors aimed to show that modern AI systems have not yet reached the level of the best human specialists in complex cognitive tasks — and to temper inflated expectations around AGI.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.
The AI world, distilled — once a week
Seven stories that actually mattered, hand-picked. No noise, no reposts, no press releases.
Done! Check your inbox for a confirmation.