Hugging Face Blog→ original

Hugging Face launched Open Agent Leaderboard to evaluate AI agents

Hugging Face released Open Agent Leaderboard, the first open benchmark for evaluating full AI agents rather than just models. It tests systems on coding, web se

Hugging Face launched Open Agent Leaderboard to evaluate AI agents
Source: Hugging Face Blog. Collage: Hamidun News.
◐ Listen to article

Hugging Face and IBM Research introduced Open Agent Leaderboard — the first open benchmark for evaluating complete agentic systems, not individual models. Research showed that the quality of an AI agent's performance depends not so much on the model itself, but on how it was built.

What the benchmark tests

The open benchmark includes six different sets of tasks:

  • Fixing real bugs in code repositories (SWE-Bench Verified)
  • Complex web search and information gathering (BrowseComp+)
  • Executing personal tasks across hundreds of applications (AppWorld)
  • Airline and retail customer support (tau2-Bench)
  • Technical support with company policy compliance (Telecom)

All tests run on a unified protocol: identical task structure, context, and available tools. This allows agents to be compared fairly without requiring them to be adapted for each benchmark.

Key finding: agent architecture matters more than the model

The analysis revealed an unexpected result. The same model embedded in different agent architectures shows completely different results — both in quality and execution cost. Moreover, failed attempts cost 20–54% more than successful ones due to repeated requests to the model. General-purpose agents proved competitive with specialized systems developed for specific tasks. This is especially important because specialized agents are harder to deploy in the real world.

"Today, model choice explains most of the results.

But agent architecture is already beginning to change the outcome," — researchers' conclusion.

What's currently available to the community

Hugging Face released several resources for developers.

Open Agent Leaderboard — an interactive table with results from all tests.

Exgentic — an open platform for running and reproducing evaluations, allowing other researchers to add their own agents and benchmarks. As one of the first results, two open-weight models were added: DeepSeek V3.2 and Kimi K2.5. They showed competitive results on individual combinations, but still lag behind closed models by 18–29% on average.

What this means

An open benchmark for agents is a step toward standardizing evaluation. As AI agents evolve, their architecture (planning, memory management, tool use, error recovery) becomes as important as model selection. The leaderboard makes these differences visible and enables the community to build better systems together.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.
What do you think?
Loading comments…