MIT Technology Review→ original

MIT Technology Review: Why Standard AI Tests No Longer Show Real-World Value

MIT Technology Review writes that conventional AI benchmarks measure models in a vacuum and thus poorly predict real-world impact. The article illustrates…

AI-processed from MIT Technology Review; edited by Hamidun News
MIT Technology Review: Why Standard AI Tests No Longer Show Real-World Value
Source: MIT Technology Review. Collage: Hamidun News.
◐ Listen to article

MIT Technology Review writes that conventional AI benchmarks are increasingly failing to show how systems behave in real work. A model can win an isolated test and still slow down a team when embedded in a live process.

Why Tests Diverge

For decades, progress in AI has been measured through "machine versus human" competition. The approach is convenient: give the model chess, math problems, coding or essays, then compare its results with how an individual human performs. Such tests are easy to standardize, turn into rankings and use in marketing. That's why a whole industry of impressive figures, leaderboards and comparisons has grown around benchmarks—things that look great in presentations.

The problem is that AI is almost never used the way it's tested. In real work, the system doesn't exist in a vacuum: it's embedded in teams, processes, rules, deadlines and internal standards. Its value emerges not in one answer, but in a series of interactions over weeks and months. That's why a high score on an isolated task doesn't yet tell us whether a model will speed up work, reduce errors, or prove useful for an organization at all.

Where Evaluation Breaks Down

The article provides a telling example from medicine. There are systems for analyzing images that on tests read images faster and more accurately than experienced radiologists. On paper, this looks like a ready-made recipe for productivity growth.

But in a hospital, decisions are rarely made by one specialist in one moment. Around a single case, radiologists, oncologists, physicists, nurses and other team members may be working, and the treatment plan changes as new data emerges. When such tools enter the real loop, it turns out that staff need extra time to interpret the model's answers, compare them with local reporting standards and verify compliance with regulatory requirements.

As a result, the system that promised acceleration in testing sometimes creates delays in practice. Moreover, it can strengthen early "anchoring" on a plausible but incomplete answer, increase cognitive load, and shift errors further down the chain. This is how the "AI graveyard" arises—products with high ratings that never take root in real work.

What They Propose Instead

Instead of narrow tests, the author proposes HAIC benchmarks—Human-AI, Context-Specific Evaluation. This is an approach where you evaluate not just the model itself, but how it behaves within a specific team, process and organizational environment. The point is to bring evaluation closer to real use, not to a lab demonstration.

  • Shift focus from an individual task to team work and the entire process
  • Measure effect not in a single test run, but over the long term
  • Consider important not just speed and accuracy, but coordination, quality of collaborative solution and visibility of errors
  • Look not just at the model's answer, but at the consequences before and after its application

The author describes early examples of this approach. In one British hospital system, the question wasn't framed as "did diagnostics become more accurate," but rather "does AI change the quality of collective discussion and interaction between specialists." In the humanities sector, a similar system was observed for 18 months and separately tracked how easily people notice and correct the model's errors. Such a long horizon allows you to design protective mechanisms for a specific context, rather than hope that a high test score by itself guarantees safety and utility.

What This Means

The industry is gradually hitting the limit of old metrics: they show well what a model can do on its own, but poorly what happens when it becomes part of a live organization. For business and government, this is a signal to look not just at leaderboards, but at whether AI helps teams work more sustainably, faster and safer in real conditions.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…