DeepMind proposed ten cognitive scales for measuring progress toward AGI

Google DeepMind published "Measuring Progress Toward AGI" — a follow-up to its 2023 AGI levels classification. Instead of a single rating, it offers ten independent scales based on tools from cognitive psychology rather than datasets. For the first time, the industry has a way to compare AI systems objectively — rather than simply taking labs’ self-assessments at face value.

Khamidun Zhemal

AI monitoring · Habr AI

Apr 30, 2026· 2 min

AI-processed from Habr AI; edited by Hamidun News

DeepMind proposed ten cognitive scales for measuring progress toward AGI — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

Google DeepMind has published a paper titled "Measuring Progress Toward AGI" — an attempt to provide the industry with a tool for genuinely measuring progress toward AGI, rather than yet another classification system with no way to verify it.

Where the problem came from

Nearly three years ago, DeepMind published "Levels of AGI" — a system of five levels of intelligence (from initial to superhuman) and six levels of autonomy (from simple tool to fully autonomous agent). The analogy with autonomous driving levels turned out to be apt: structured, visual, convenient for explaining to investors and journalists. The industry gained a common vocabulary — something like unified terminology for talking about AGI.

But the classification revealed a fundamental flaw: there was no tool to verify where any given system actually stood. Each company could call its model "level 2" or "level 3," and no one had a way to dispute it. "AGI" became a marketing label — convenient for press releases and attracting investment, but completely inconvenient for science.

This new work attempts to solve this very problem.

Ten scales instead of one score

The paper, released in March 2026, proposes a fundamentally different approach. Instead of a single overall rating — ten separate scales, each measuring a specific aspect of cognitive abilities. Moreover, the scales are independent: a system can show a high result in reasoning but low in adaptation to new tasks — and this mismatch will be clearly visible, not hidden behind an averaged value. This approach provides a multidimensional portrait of a system, not a single number.

The fundamental difference from conventional benchmarking: the scales are built not on datasets and problem sets, but on cognitive psychology tools — a science that has for decades researched intelligence in real people and developed methodologies resistant to training effects.

Among the measured aspects:

Working memory and context retention
Planning and multi-step reasoning
Transfer of knowledge to new domains
Learning from a small number of examples (few-shot)
Meta-cognition — understanding the boundaries of one's own knowledge
Causal reasoning
Adaptation to data outside the training distribution

The authors position the framework as a starting point for discussion, not a final standard. The list of scales is open for expansion.

Why this matters more than benchmarks

Until now, progress in AI has been measured indirectly: MMLU, HumanEval, ARC-Challenge, GSM8K. The problem is that models have learned to deliberately "overfit" to specific benchmarks. A high score on MMLU ceased long ago to be a reliable indicator of actual reasoning — and everyone in the industry knows this, but standards don't change. The cognitive-psychological approach is significantly harder to fool. If a model can't generalize to fundamentally new tasks — no additional training on the test set will hide this. Methodologies developed to measure intelligence in humans are by their very nature resistant to "gaming" the system.

For investors, corporate AI buyers, and regulators, this potentially means the end of the era when any laboratory could announce an "AGI breakthrough" without the possibility of independent verification. Common measurable scales create comparability between systems from different companies, and thus — accountability.

What this means

DeepMind is shifting the conversation about AGI from "we have level N" to "here's specifically how this can be measured." This is not an answer about AGI timelines and not a guarantee of consensus — different laboratories will interpret the scales differently. But it is the first serious step toward common evaluation standards, built on science rather than marketing.

Hamidun News

AI news without noise. Daily editorial selection from 50+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Need AI working inside your business — not just in your newsfeed?

I build production AI for companies — custom CRM, internal tools, autonomous agents, workflow automation. Owned by you, shaped to your process, no per-seat tax. Built by Zhemal Khamidun, CPO of AlpinaGPT (AI platform, 6,000+ users).

Book a free consultation →