METR Explains How AI Approaches Autonomous Execution of Complex Tasks for Nearly 12 Hours

Q: What is the source?

Originally published on Bloomberg Tech. Hamidun News processes and adapts the material with AI.

Q: When was it published?

Apr 27, 2026. Reading time: 3 min.

METR discussed on Bloomberg why the AI market is increasingly focusing less on polished answers and more on models' ability to complete complex tasks…

Hamidun News Editorial

AI monitoring · Bloomberg Tech

Apr 27, 2026· 3 min

AI-processed from Bloomberg Tech; edited by Hamidun News

METR Explains How AI Approaches Autonomous Execution of Complex Tasks for Nearly 12 Hours — Source: Bloomberg Tech. Collage: Hamidun News.

◐ Listen to article

In a Bloomberg Tech video based on the Odd Lots podcast episode from April 25, 2026, representatives from the METR research organization explained why the main question around AI now sounds not like "can a model answer a query," but "how long is it capable of autonomously pulling through a complex multi-step task." According to their assessment, Claude Opus 4.6 is already approaching a level where an agent can complete work with notable probability that would take a human almost 12 hours.

METR, or Model Evaluation and Threat Research, measures how far leading models have advanced in autonomous operation. Organization president Chris Painter and researcher Joel Becker discussed not ordinary knowledge benchmarks, but tasks where the model must plan, use tools, write and verify code, fix errors, and bring the work to completion without constant human prompts. This mode is precisely what matters for evaluating the real utility of agent systems and their associated risks.

METR's key metric is time horizon. It is not the time the AI spends on a task, but the task's complexity measured by how much time a qualified human would spend on it. On METR's official leaderboard, this assessment is built on more than a hundred assignments from development, machine learning, and cybersecurity domains.

For each model, researchers run multiple independent runs, compare the result with human baseline scores, and then build a success probability curve. The process itself takes not hours but at least one to two weeks of calendar time, because the team must select the working infrastructure, check for failures, rule out attempts to circumvent the evaluation, and manually recheck disputed runs. If a model has a 50-percent horizon of several hours, it means it succeeds on tasks of such complexity roughly half the time.

That is precisely why the phrase about nearly 12 hours for Claude Opus 4.6 sounds notably more serious than another benchmark record. It is not about a polished chat answer, but about the ability to maintain context, break work into stages, and not fall apart after the first failure.

In METR's January update Time Horizon 1.1, the organization also noted that historically the ability horizon of leading models doubled roughly every seven months, and in measurements for models after 2023 the pace looked even higher. At the same time, METR itself separately warns: such figures cannot be directly translated into readiness to replace humans in any intellectual work.

Its set of tasks consists mostly of well-specified engineering and research cases with clear result verification. In ordinary work there is too much hidden context, communication, and ambiguous success criteria. Another conclusion also follows from the discussion.

When people say AI is beginning to work together, in practice it increasingly means a combination of a model, tools, and a control loop, not simply a second chatbot in the next window. Modern agent systems already know how to call code editors, run tests, search for information, and pass intermediate results to the next step. The longer the autonomous work horizon of the base model, the more useful such chains become and the harder it is for a human to maintain full control over every action.

That is why METR views the growth of horizon not only as product progress, but also as a signal for risk assessment, including scenarios where systems gain too much autonomy. The practical significance of this discussion is that the AI market is gradually shifting from comparing answers to comparing work autonomy. For companies this is a question of which processes can already be delegated to agents.

For model developers this is a question of how quickly the real ability of systems to bring long tasks to completion is growing. And for regulators and safety researchers this is an early indicator of the moment when the conversation about autonomous AI will cease to be theory and become operational reality.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation