Habr AI→ original

How to stop guessing and start measuring the quality of AI agents

The Bitrix24 team explained how, over six months, it moved from manually testing its AI agent Marta to an automated benchmarking system. The problem is familiar

AI-processed from Habr AI; edited by Hamidun News
How to stop guessing and start measuring the quality of AI agents
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

Every team that has shipped an AI agent to production eventually faces the same nightmare. A user writes to support: "your bot is talking nonsense." A developer opens the logs, looks at the prompt, looks at the response — and can't figure out what exactly went wrong. Is it a regression after yesterday's commit? A side effect of switching models? Or just an unlucky edge case that's always existed? The Bitrix24 team went through every stage of this process with their AI agent Martha — and now they're sharing the lessons that everyone working with large language models in production should learn.

Martha is an AI assistant inside the Bitrix24 ecosystem that interacts with CRM, manages tasks, and answers user questions. In the early stages, everything looked simple: open a chat, ask a question, look at the answer with your own eyes. Classic manual testing, which works while the agent can do ten things. But as soon as Martha's functionality grew, this approach started to crack. One engineer is physically unable to run two hundred scenarios after every prompt revision. And prompt revisions in modern AI product development are not the exception — they're daily routine.

The problem the team describes is systemic in nature. Prompt engineering is inherently unstable by its nature: even the slightest change in instruction wording can unpredictably affect model behavior across dozens of different contexts. Add to that periodic model version changes from providers, updates to system prompts, expansion of the set of tools available to the agent — and you get a combinatorial explosion of potential failure points. Without automated quality control, the team effectively works blind, reacting to problems after the fact instead of preventing them.

The solution Bitrix24 arrived at was building a full-fledged benchmarking system. The essence of the approach is to formalize expectations for the AI agent as a set of test scenarios with measurable success criteria. These aren't unit tests in the classical sense: language model responses are non-deterministic, and checking them for exact matches is pointless. Instead, metrics are used that evaluate relevance, completeness, correctness of tool invocation, and alignment with communication tone. In essence, the team is building an automated analog of expert evaluation that can be run after every change.

The path from idea to a working system took about half a year — and this is an honest figure that says a lot about the maturity of the tooling in this area. Off-the-shelf solutions that you can take and plug into an arbitrary AI agent practically don't exist. Each team has to independently figure out which metrics reflect the quality of their specific product, how to generate and maintain test datasets in current form, how to interpret results and integrate benchmarks into the CI/CD pipeline. Bitrix24 emphasizes that their approach is not tied to a specific tech stack — and this is perhaps the most valuable part of their experience.

Martha's story reflects a broader trend in the industry. As AI agents transition from the category of experiments to the category of business-critical tools, requirements for their reliability grow exponentially. Companies like Anthropic, OpenAI, and Google invest in model evaluation systems at the platform level, but at the level of specific products, responsibility for quality still rests with development teams. The problem is compounded by the fact that users quickly lose trust in an AI assistant after a few failed responses, and regaining that trust is significantly harder than losing it.

Separate attention is warranted for the cultural shift behind this transition. Manual testing of AI agents is not just an inefficient practice — it's a false sense of control. An engineer who has tested twenty scenarios out of two hundred tends to think the system works correctly, when in fact they've tested only ten percent of the surface. Automated benchmarks don't eliminate uncertainty completely, but they make it visible and measurable. And what can be measured can be improved.

Bitrix24's experience is a signal for the entire Russian-language AI development industry. The era when an AI agent could be shipped to production with the words "seems to work" is coming to an end. Ahead lies an era of metrics, benchmarks, and continuous quality control. And those teams that master these practices first will gain a decisive advantage in the fight for user trust.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…