Claude Code and 11 Agents: How a QA Team Automated Up to 80% of Testing Routine

Instead of hiring two more testers, the QA team built a system of 11 agents based on Claude Code. It analyzes Jira and Confluence, builds test scenarios following ISTQB, writes API and UI automated tests, and opens Merge Requests independently. In the pilot, over a month and a half, regression testing grew from 50 to 400 test cases, and feature verification time was reduced from days to hours.

Khamidun Zhemal

AI monitoring · Habr AI

Apr 28, 2026· 3 min

AI-processed from Habr AI; edited by Hamidun News

Claude Code and 11 Agents: How a QA Team Automated Up to 80% of Testing Routine — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

A QA team has shown that AI agents can already take on the bulk of testing routine work: from reading requirements and designing test scenarios to uploading test cases into TMS and creating Merge Requests with ready-made autotests. Instead of expanding their staff by two more testers, the team built an "exoskeleton" of 11 specialized agents based on Claude Code and claims it covers up to 80% of a QA engineer's operational work. They started from a typical product team problem: development releases new features faster than testing can turn requirements into scenarios, data, and automation.

By internal assessment, about 20% of QA time goes to requirements analysis and test design, another 15% to creating cases in TMS, 10% to data preparation, 25% to automation, and then come the checks themselves, regression, and reports. In total, up to 80% of the workload can be described by rules and broken down into repeatable stages. Hence the idea: not to replace the engineer, but to turn them into a pipeline operator who sets the task, controls artifacts, and intervenes only where the system lacks context or needs to make an ambiguous decision.

The architecture is built as a modular chain of 11 skills and a separate orchestrator. One agent pulls the task from Jira and related materials from Confluence, another decomposes requirements into User Stories and tasks, a third generates test scenarios according to ISTQB rules, the next ones handle missing data, discrepancies, DOM selectors, API and UI autotests, comparing scenarios with code, uploading cases to Zephyr Scale and creating Merge Requests in GitLab. JSON scenarios with full traceability to requirements and acceptance criteria serve as a single source of truth, while an RTM coverage matrix is built in parallel.

For the frontend part, the system additionally accesses Figma through MCP and extracts not just screenshots, but the structure of the interface, element states, and constraints. Special emphasis was placed on quality and protection against typical LLM weak points. After scenario generation, the orchestrator runs quality control checkpoints: checks JSON schema, completeness of steps, priorities, absence of duplicates, and requirements coverage.

After autotests generation, control becomes even stricter: Python code, fixtures, and actual test runs are validated. A two-stage debugging scheme is used. First, the system runs each test separately and separates test code issues from real product defects.

Then mutation testing kicks in: on an already passing test, the assert is inverted, and if it still remains green, such a test is considered empty and requires refinement. Another important layer is the conflict protocol between Jira, Figma, requirements text, and actual interface behavior. Obvious conflicts are resolved automatically by source priority hierarchy, while disputed cases are escalated to the engineer.

In practice, the pilot over one and a half months delivered numbers that typically justify launching such experiments. The number of regression test cases grew from approximately 50 to 400, scenario detail became complete, and regression coverage by automation approached 100%. Regression time itself was cut from approximately one day to tens of minutes, the path from development completion to QA approval from several days to several hours, and onboarding testing on a new project now takes about four hours of setup instead of months for hiring and adaptation.

Additionally, the system began finding more hidden requirement contradictions and bugs than the manual process. Meanwhile, the pilot version runs on a Claude Pro subscription for $100 per month and, as claimed, is capable of serving 2–3 projects with over 100 tests per month for each. The main takeaway from this case is that the QA role is indeed beginning to shift from manual execution to managing context, rules, and the quality of decisions that AI makes.

But the story only works under several conditions: requirements must be sufficiently complete, the project must have a proper source of context like API contracts and test data, and the pipeline itself must not be a "black box." The value here is not in magical test generation, but in a transparent chain of steps that can be re-run, checked, and gradually strengthened. If this approach takes off beyond pilots, the testing market could get not a replacement for engineers, but a much more productive and scalable format for their work.

Hamidun News

AI news without noise. Daily editorial selection from 50+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation

Claude Code and 11 Agents: How a QA Team Automated Up to 80% of Testing Routine

Want to stop reading about AI and start using it?

The AI world, distilled — once a week