IBM Research analyzed where AI agents break down on APIs, documents, and rules in VAKRA

Q: What is the source?

Originally published on Hugging Face Blog. Hamidun News processes and adapts the material with AI.

Q: When was it published?

May 2, 2026. Reading time: 3 min.

IBM Research analyzed why agentic models break down not on a single tool call, but across long chains of actions. In VAKRA, agents get 8,000 APIs, documents…

Hamidun News Editorial

AI monitoring · Hugging Face Blog

May 2, 2026· 3 min

AI-processed from Hugging Face Blog; edited by Hamidun News

IBM Research analyzed where AI agents break down on APIs, documents, and rules in VAKRA — Source: Hugging Face Blog. Collage: Hamidun News.

◐ Listen to article

IBM Research has detailed an analysis of why even strong language models still fail at tasks for agent systems. A new analysis of the VAKRA benchmark shows: making a nice API call is not enough — problems begin when you need to go through multiple steps, select the right data source, and not violate tool usage rules.

How VAKRA is organized

VAKRA is an executable benchmark for enterprise agents. Instead of toy function calls, it gives models a working environment with over 8,000 locally deployed APIs, real databases across 62 domains, and document collections for specific subject areas. A typical scenario requires not a single answer, but a chain of 3–7 steps: get data, select the right tool, extract a fact from a document, pass the result to the next call, and only then assemble the final answer.

The key idea is that VAKRA evaluates not only the model's final response, but its entire trajectory of actions. For complex tasks, the system first checks whether the agent adhered to textual constraints on tool usage, then replays its calls in the same environment, compares intermediate results with the benchmark, and only then evaluates the final answer. This approach is important because an agent can accidentally guess the final conclusion while reaching it through the wrong path — and for production, that's almost useless.

Four types of tasks

The authors divide VAKRA into four modes, each testing a separate layer of agent behavior. Together they cover the path from simple API-chaining to multi-step reasoning over APIs and documents with external constraints. This matters because many agents look confident on single calls but quickly get lost when they need to simultaneously plan steps, switch between sources, maintain dialogue context, and remember tool access rules.

Business Intelligence APIs: 2,077 tasks across 54 domains, where the agent needs to sequentially call 1–12 tools and carefully work with parameters and data filtering.
Dashboard APIs: 1,597 tasks across 17 domains, where the main complexity is selecting the right endpoint among 6–328 available tools.
Multi-hop over APIs: 869 tasks across 38 domains, where the answer is assembled through several logical transitions, from one to five.
Multi-source + policies: 644 tasks across 41 domains, where the agent alternates between APIs and document search, accounts for dialogue history, and follows textual rules like "use only retriever, don't touch other tools."

Where agents fail

The most useful part of the article is the breakdown of where models break. The authors divide errors by stage: choosing the wrong tool, skipping necessary arguments or hallucinating them, wrong parameter values, and finally, an incorrect final answer even after correct calls. On the BI API segment, GPT-OSS-120B performed best: it notably better understood tool schemas and made fewer mistakes in names and parameter filling.

But even there, success on individual steps did not guarantee stable end-to-end results. On tasks with a large set of dashboard APIs, Gemini-3-flash-preview performed best, which makes sense: there the ability to shortlist tools and precisely select an endpoint is most important. As reasoning depth grew, quality dropped for all models: 2-hop and especially 3+ hop questions showed noticeably lower accuracy.

It got even worse when APIs had to be combined with document retrieval. The authors specifically note a telling failure: on some 1-hop RAG tasks, GPT-OSS-120B sometimes didn't call the retriever at all and tried to answer "from memory," which in such a benchmark counts as an error. Policies added another layer of complexity: models either violated constraints or followed them but failed to gather the information needed for the answer.

What this means

VAKRA shows an unpleasant but useful truth about agent systems: the ability to make a pretty demo with tool calling doesn't mean readiness for real business processes. For teams choosing a model for support, analytics, compliance, or internal workflows, the main question is now not "can it call tools," but "does it maintain a correct sequence of actions under constraints, across multiple sources, and without overconfident shortcuts."

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation