Habr AI: Why Agent Systems Need New Control and Safety Metrics
When an LLM shifts from chatbot to agent, response quality assessment alone is no longer sufficient. Critical metrics include task completion, plan quality…
AI-processed from Habr AI; edited by Hamidun News
The transition from chatbots to agentic systems fundamentally changes what needs to be controlled: whereas it was once sufficient to understand how useful and correct a model's response to a query is, now it becomes necessary to evaluate the entire chain of actions the system builds itself. An agent does not simply generate text, but plans steps, selects tools, requests data, can delegate work to other agents, and make intermediate decisions. In such an architecture, a polished final answer no longer guarantees that the system performed reliably, safely, and cost-effectively.
For a classical chatbot, the main metrics were typically response quality, accuracy of phrasing, relevance, and user satisfaction. For an agent, this is insufficient, because an error can appear long before the final message. The system can incorrectly decompose a task into stages, select an unsuitable tool, terminate the scenario prematurely, get stuck in repetitive actions, or conversely take unnecessary steps and spend too many tokens, time, and external requests.
Therefore, focus shifts not only to the result, but also to the trajectory by which the agent arrived at it. This expands the set of metrics. First comes the proportion of successfully completed tasks: not merely whether the agent provided a plausible answer, but whether it achieved the user's goal without manual intervention.
Next are planning quality indicators—how logically the steps were chosen, how many are actually necessary, how often the plan must be revised during execution. Separately important is measuring tool invocation correctness: did the agent select the right API, pass valid parameters, obtain the expected result, and adequately handle any error? For multi-agent systems, coordination metrics are added: do agents not duplicate each other's work, lose context, or create conflicting actions?
Equally important are cost and observability. Agentic systems are almost always more expensive than regular dialogue, because each additional step, call to the model, or external service has a cost. Thus, control must account for average iterations per task, token consumption, frequency of retries, execution duration, and the proportion of meaningless actions.
In parallel, tracing requirements grow: the team needs to see what decision the agent made at each stage, what data it relied on, why it selected a particular tool, and at what point it deviated from the expected scenario. Without such transparency, it is impossible to debug behavior, investigate failures, or prove compliance with internal policies. From this comes a shift in security requirements.
If a chatbot mainly risked producing incorrect or dangerous text, an agent can already perform an undesirable action: send a request to the wrong place, gain unintended access to data, modify a record in a system, or use a tool outside its permitted context. Therefore, agentic architecture requires granular access control, a sandbox approach for tools, strict policies on action execution, limits on autonomy, and stopping mechanisms if the system exhibits suspicious behavior. Security here ceases to be a filter at the input and output and becomes part of the operational loop.
Another shift concerns operations. For an agentic system, it is important not only to execute a task in an ideal scenario, but also to degrade safely in case of failure. Recovery metrics become useful: how often can the agent correct its own error, when it transfers a task to a human, how many incidents require manual investigation, and how quickly can the team reproduce the problem from logs?
In practice, this means that product and platform teams need to design not only the agent's intelligence, but also its failure modes, monitoring, and intervention procedures. The main conclusion is that agentic systems cannot be evaluated by the same rules as ordinary chat interfaces. Companies must transition from checking response quality to full execution engineering: measuring task completion, plan robustness, tool invocation correctness, cost, traceability, and adherence to security rules.
The more autonomous an LLM becomes, the closer its control approaches monitoring a complex software service rather than editing successful or failed utterances.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.