IBM and Artificial Analysis create benchmark: AI agents fail at IT tasks
Large models failed the new test. IBM and Artificial Analysis unveiled ITBench-AA — the first benchmark for AI agents in corporate IT environments. Leading mode

IBM and Artificial Analysis presented ITBench-AA — the first comprehensive benchmark for evaluating the ability of AI agents to solve real tasks in corporate IT environments. The results are disheartening: leading models scored less than 50%. This means that the dream of autonomous AI engineers remains just a dream.
What ITBench-AA tested
The benchmark included real-life IT operations scenarios: configuring network infrastructure, managing databases, debugging errors in production systems, deploying applications, monitoring and optimizing resources. These aren't simple written tasks where you need to provide the correct answer. Here, AI must act as a full-fledged engineer: interact with system interfaces, analyze error logs, make decisions under uncertainty, adjust the approach if the first attempt didn't work.
Leading models were tested: GPT-4, Claude 3 Opus, Gemini Ultra and others. The results were roughly the same: all performed at around 45–50%. Even more notably, when attempting to execute complex multi-step procedures, agents often got stuck or made critical errors.
What the real problem is
The figure of 50% is not just a low result. It's a signal of fundamental limitations. IT work requires not only extensive knowledge but also qualities that AI currently possesses inconsistently:
- Flawlessness — one mistake can take down a system for thousands of users
- Sequential thinking — multi-step procedures require strict adherence to logic
- Contextual understanding — knowing not just what to do, but why each step is critical
- On-the-fly adaptation — when standard instructions don't fit due to environment specifics
- Accountability — the ability to step back and ask for human help when uncertain
Agents in their current form are more systems that can help, but require constant supervision and validation of results.
Resetting expectations
ITBench-AA is already influencing company strategies. The illusion of "digital workers who will replace the IT department in a month" is fading. Instead, demand is growing for more realistic solutions: partnership between humans and AI, where the agent takes on routine work (config updates, basic monitoring, logging), and the engineer maintains control over critical operations.
The benchmark also creates for the first time a universally recognized standard for evaluating agents. ITBench-AA will become a tool for model developers to understand what to work on in next versions.
What this means
AI is evolving, but evolution is moving slower than startups promise. Good news for IT specialists: your expertise remains a scarce resource. For companies, this is a signal: complete automation of IT tasks is not a project for a year or two. For model developers, it's a specific roadmap for improvements.