How to Measure an AI Agent's Performance in QA: The Story of a Benchmark
Mikhail Fedorov built an objective benchmark for QA Assist, a system of 11 AI agents that automate testing. Instead of subjective assessments of agent performan

When an AI agent works on testing tasks, the main question becomes pressing: is it really better than yesterday? This question cannot be answered simply — you need numbers. Mikhail Fedorov, developer of QA Assist, faced this problem head-on. QA Assist is a system of 11 AI agents that together cover the entire testing cycle: from requirement decomposition to ready-made automated tests. But how do you assess whether the system improved after the next update? By eye — unreliable.
Why Visual Assessments Don't Work
Subjective evaluation can be misleading: the agent found 5 bugs yesterday, 7 today, but can you be sure that the system actually improved rather than the test set simply changed? Different model versions, different prompts, different LLM temperature parameters — all of this affects the result. Without a systematic benchmark, it's difficult to figure out what exactly helps. Fedorov solved the problem radically: he created a separate benchmark project where the agent works by the same rules, on the same requirements, with the same edge cases.
What the Benchmark Can Do
- Compare different agent versions on the same dataset
- Test the impact of individual pipeline improvements (prompt engineering, changes to decomposition logic)
- Experiment with models: GPT-5.5 vs Claude vs others
- Track progress over time with visualization of improvements
- Generate a complete report on the percentage of bugs found, misses, and false positives
Important: a benchmark does not mean "an ideal test set." It means a controlled test set where variables are minimized and each run is reproducible.
Artifacts in a Single Run
With each execution, the agent prepares a complete package — documented requirements and their decomposition, test scenarios with steps, ready-made automated test code, coverage and miss report, log of accepted and rejected decisions. All artifacts are stored in a public repository, so you can see how the agent reasons on different examples. This is useful not only for tracking progress, but also for debugging: when the agent makes a mistake, you can see at which step of the pipeline and why.
What This Means
For QA tool developers, benchmarks become mandatory — it's the only way to be honest with yourself about the quality of work. Open access to Fedorov's project demonstrates that such transparency is possible. Other teams working with AI agents in testing now know what needs to be done from the start.
Хотите не читать про ИИ, а внедрить его?
«AI News» — это полезные новости из мира ИИ. Системно научиться работать с нейросетями и применять их в работе — в Hamidun Academy.