Yandex compared MCP and CLI+Skill for AI agents: 400 requests and an unexpected failure
The Yandex team found that when an AI agent works with internal APIs, the choice of architecture directly affects token usage. They compared MCP and CLI +…
AI-processed from Habr AI; edited by Hamidun News
Yandex's Urban Services team conducted a benchmark comparing two ways to connect an AI agent to internal APIs — and discovered that the architectural choice directly affects token efficiency.
The Problem: Tokens Are Not Infinite
A limited context window — everyone knows that. But few actually count how many tokens are spent not on the task itself, but on the "wrapper": tool descriptions, parameter lists, intermediate results of calls. In complex scenarios, these overhead costs can take up a significant portion of available context — and then the agent starts making mistakes not because the model is bad, but simply because there's no useful space left.
Daniil Mikhailov from Yandex's partner products team posed the question directly: how to do more while spending fewer tokens when working with real internal APIs?
MCP vs CLI + Skill
The team compared two ways to integrate an agent with tools. MCP (Model Context Protocol) — a structured protocol: the agent receives a description of each tool in explicit format, calls go through a standardized layer. The plus — universality and predictable schema. The minus — each tool description takes up space in context entirely.
CLI + Skill — an alternative approach: the agent accesses the command line, and knowledge about tools is embedded in a compact "skill" — a pre-written prompt instruction. The description is more compact, but requires manual maintenance.
To test the hypothesis drawn from external research, they assembled a benchmark:
- 14 real-world scenarios of working with Yandex's internal tools
- 2 language models
- More than 400 requests
- Measurements of accuracy and token spending in each scenario
The Moment When Everything Broke
The most valuable finding came not at the end, but in the middle of the experiment: what worked stably suddenly stopped. According to Mikhailov, this failure turned out to be more interesting than the final numbers — they had to understand why.
"At some point, everything that worked broke — and that turned out to be the most interesting part.
I had to figure out why."
Such anomalies in benchmarks often expose hidden dependencies: how the model interprets the schema format, how tools behave under repeated calls, how stable the output is with different task formulations. Without such a "stress moment," the results could have turned out naively optimistic.
Result: A Decision Tree
Based on the series of experiments, the team compiled a practical decision tree: when MCP is more profitable, and when — CLI + Skill. This is not an abstract recommendation, but a conclusion from real data — more than 400 requests in real infrastructure.
What This Means
Choosing a way to connect an agent to an API is not a technical whim. It affects how many tokens are wasted, how long context lasts, and how stable the agent behaves in non-standard scenarios. For teams building product agents on top of internal systems, this research provides a concrete tool for choosing architecture — not for marketing reasons, but based on real measurements.
Need AI working inside your business — not just in your newsfeed?
I build production AI for companies — custom CRM, internal tools, autonomous agents, workflow automation. Owned by you, shaped to your process, no per-seat tax. Built by Zhemal Khamidun, CPO of AlpinaGPT (AI platform, 6,000+ users).
The AI world, distilled — once a week
Seven stories that actually mattered, hand-picked. No noise, no reposts, no press releases.
Done! Check your inbox for a confirmation.