IBM and Artificial Analysis create benchmark: AI agents fail at IT tasks

Q: Источник материала?

Оригинальная публикация на Hugging Face Blog. Hamidun News обрабатывает и адаптирует материалы с помощью AI.

Q: Когда опубликовано?

2026-05-29. Время чтения: 3 мин.

Large models failed the new test. IBM and Artificial Analysis unveiled ITBench-AA — the first benchmark for AI agents in corporate IT environments. Leading mode

Hamidun News Editorial

AI monitoring · Hugging Face Blog

2026-05-29· 3 min

IBM and Artificial Analysis create benchmark: AI agents fail at IT tasks — Source: Hugging Face Blog. Collage: Hamidun News.

◐ Listen to article

IBM and Artificial Analysis presented ITBench-AA — the first comprehensive benchmark for evaluating the ability of AI agents to solve real tasks in corporate IT environments. The results are disheartening: leading models scored less than 50%. This means that the dream of autonomous AI engineers remains just a dream.

What ITBench-AA tested

The benchmark included real-life IT operations scenarios: configuring network infrastructure, managing databases, debugging errors in production systems, deploying applications, monitoring and optimizing resources. These aren't simple written tasks where you need to provide the correct answer. Here, AI must act as a full-fledged engineer: interact with system interfaces, analyze error logs, make decisions under uncertainty, adjust the approach if the first attempt didn't work.

Leading models were tested: GPT-4, Claude 3 Opus, Gemini Ultra and others. The results were roughly the same: all performed at around 45–50%. Even more notably, when attempting to execute complex multi-step procedures, agents often got stuck or made critical errors.

What the real problem is

The figure of 50% is not just a low result. It's a signal of fundamental limitations. IT work requires not only extensive knowledge but also qualities that AI currently possesses inconsistently:

Flawlessness — one mistake can take down a system for thousands of users
Sequential thinking — multi-step procedures require strict adherence to logic
Contextual understanding — knowing not just what to do, but why each step is critical
On-the-fly adaptation — when standard instructions don't fit due to environment specifics
Accountability — the ability to step back and ask for human help when uncertain

Agents in their current form are more systems that can help, but require constant supervision and validation of results.

Resetting expectations

ITBench-AA is already influencing company strategies. The illusion of "digital workers who will replace the IT department in a month" is fading. Instead, demand is growing for more realistic solutions: partnership between humans and AI, where the agent takes on routine work (config updates, basic monitoring, logging), and the engineer maintains control over critical operations.

The benchmark also creates for the first time a universally recognized standard for evaluating agents. ITBench-AA will become a tool for model developers to understand what to work on in next versions.

What this means

AI is evolving, but evolution is moving slower than startups promise. Good news for IT specialists: your expertise remains a scarce resource. For companies, this is a signal: complete automation of IT tasks is not a project for a year or two. For model developers, it's a specific roadmap for improvements.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com