UCL and Stanford researcher explains why AI benchmarks no longer work
High scores on AI benchmarks say less and less about real-world value. Angela Aristidou of UCL and Stanford writes that models should be tested not in a…
AI-processed from Habr AI; edited by Hamidun News
High scores on AI benchmarks no longer guarantee that a model will be useful in real-world work. Researcher Angela Aristidou from UCL and Stanford proposes restructuring the very logic of AI evaluation: looking not at results in a vacuum, but at how systems behave inside teams, processes, and long work cycles.
Why Tests Break
Today's benchmarks are convenient because they reduce everything to a simple question: did the model solve an isolated task better than a human? This approach works well for chess, exam questions, short code snippets, or texts with unambiguous answers. The industry gets clear rankings, accuracy percentages, and nice comparison tables.
The problem is that almost no one uses AI exactly as it's tested. In organizations, models don't work in a sterile environment, but in messy processes with multiple participants, internal rules, exceptions, and changing inputs. What matters is not only the speed and accuracy of the answer, but whether AI accelerates approvals, helps the team notice errors, and doesn't create a new layer of operational noise.
So a model that excels at synthetic tests can turn out to be a weak link in a real feedback loop.
The Problem of Real Teams
Aristidou gives an example of medical AI systems that formally show very strong results and even get regulatory approvals. In practice, doctors must embed their conclusions into local reporting standards, clinic requirements, and shared decision-making logic. Because of this, a tool that saves time on paper can actually introduce delays in the real process.
This is especially noticeable in environments where decisions are made not by a single specialist, but by a multidisciplinary team. Radiologists, oncologists, nurses, and other participants discuss the patient together, and the treatment plan is refined as new data arrives. In such a system, what matters is not only the accuracy of the suggestion, but how it affects collective discussion.
If a model triggers premature certainty, increases cognitive load, or breaks familiar coordination, a high test score doesn't mean much. This is how AI projects end up in what the author calls the "AI graveyard."
What HAIC Proposes
Instead of evaluating a single model on a one-off task, the author proposes the HAIC approach — Human-AI, Context-Specific Evaluation. Its idea is to measure the "human + AI" bundle in a specific work environment and over the long term. This is not about completely rejecting tests, but about shifting focus: from lab accuracy to real organizational impact. HAIC changes evaluation logic across several dimensions:
- instead of evaluating a single executor, the team and entire workflow are assessed
- instead of a single test, a long cycle of use is considered
- instead of accuracy and speed, coordination, final results, and error visibility are put at the center
- instead of an isolated answer, consequences for neighboring processes and decisions are analyzed
This approach is already being tried in practical cases. In one British hospital network, the question was not "does AI improve diagnostic accuracy," but "what changes in the work of a multidisciplinary team when AI is added to it." In the humanitarian sector, similar systems were tested for 18 months, separately tracking how easily people noticed and corrected model errors. It is precisely these long observations that allow you to understand where guardrails are needed and where the technology truly helps.
What It Means
The market is gradually hitting the limit of synthetic metrics: they remain useful for basic model comparison, but they increasingly fail to predict the real value of deployment. If the HAIC approach becomes widespread, companies and regulators will have to evaluate AI more complexly and for longer — but with less risk of investing in a system that looks beautiful in benchmarks but fails in a live process.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.