Raft shows how companies can evaluate AI agents before deploying in workflows

Q: What is the source?

Originally published on Habr AI. Hamidun News processes and adapts the material with AI.

Q: When was it published?

Apr 29, 2026. Reading time: 3 min.

Raft examined how businesses can assess AI agent reliability before deployment. The key idea is not to focus on impressive demos and overall success rates…

Hamidun News Editorial

AI monitoring · Habr AI

Apr 29, 2026· 3 min

AI-processed from Habr AI; edited by Hamidun News

Raft shows how companies can evaluate AI agents before deploying in workflows — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

Raft released a practical breakdown of how companies can verify the reliability of AI agents before entrusting them with real business processes. The main idea of the article is straightforward: an agent cannot be trusted based on a demonstration or average success rate — it needs to be regularly run through evals with clear criteria.

Why trust is scarce

As agentic systems transition from experiments to working scenarios, business faces a rational question: what to do if an agent makes mistakes, violates rules, or starts behaving strangely. With a human, you can analyze the incident, change motivation, and introduce controls. With AI, this doesn't work.

A model has no inherent incentives to behave "correctly," so trust in it cannot be built on feelings, vendor promises, or a single successful pilot. The authors propose viewing trust as result repeatability. If a system consistently receives similar input data and reliably produces the expected result, it can be entrusted with that class of tasks.

If every action needs manual verification, the value of automation quickly disappears. Therefore, evals here act not as additional analytics, but as a basic mechanism for clearing an agent to work.

How to build an eval set

The starting point is a ground truth set: a collection of real or as-close-to-reality-as-possible cases where input data is linked to the expected outcome. Usually such a set is compiled from historical tasks the team has already processed manually. The article specifically emphasizes that evals don't require thousands of examples like fine-tuning does. What matters more is that each case is unambiguous: two independent experts should answer the same way whether the agent passed the check or not. A typical eval set consists of several layers:

tasks with specific input data and success criteria
test runs of the agent with final results
one or more graders for different quality aspects
transcript of steps: tool calls, intermediate actions, and routing logic

As an example, Raft describes an e-commerce support agent that handles returns. One case tests a simple return within 30 days, another tests a rejection for a request outside policy, a third tests an ambiguous situation where you can neither automatically refund nor simply reject without clarification. This design shows something important: you need to evaluate not only the final answer, but also the behavior along the way to it.

Sometimes the best outcome is not an action, but correct escalation to a human. For the checks themselves, three approaches can be mixed. Deterministic graders work where precise signals matter, like refund amounts or tool invocations.

LLM judges are useful for evaluating tone, completeness, and clarity of response. Humans are needed at the start to gather reference data and calibrate automated evaluators, otherwise the system will quickly start measuring what's convenient rather than what actually matters to the business.

What metrics to look at

A separate emphasis in the article is on the fact that agentic systems are non-deterministic. Therefore, rigidly checking each step makes no sense: the same good result can be achieved via different paths. But the path still matters because it consumes time, tokens, and tool access, and can also violate internal policies.

A good eval should answer two questions at once: is the result correct and was the path to it reasonable? 95% pass rate sounds great — until the errors are false positives. This is why pass rate alone is insufficient.

For binary decisions, it's useful to look at confusion matrix, precision, recall, and F1, because different error types cost the business differently. An agent that approves returns too easily creates one risk category; an agent that massively rejects legitimate requests creates a completely different one. On top of this, the authors remind about typical pitfalls: Goodhart's law, eval set decay, and the illusion of a "green" dashboard, when the metric looks fine but real user complaints are growing.

What it means

For companies wanting to deploy AI agents in support, operations, or development, the key takeaway is one: first you need to build a verification system, and only then scale automation. The winning teams aren't those whose agent looks smarter in a demo, but those who understand the cost of its errors, can measure quality against scenarios, and regularly update evals alongside the product.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation