Sber showed how RAG and LLM in the IDE turn manual scenarios into automated tests
Sber showed a prototype for JetBrains IDE that generates automated tests from manual scenarios using a combination of LLM and RAG. The system searches for…
AI-processed from Habr AI; edited by Hamidun News
Sber demonstrated how to reduce one of the most routine tasks in QA: converting manual test scenarios into automated tests directly in the IDE. At its core is a combination of LLM and RAG, which not only writes code, but first searches for relevant examples within the project itself to preserve its style, conventions, and architecture.
Why Direct Prompting Doesn't Work
The idea of "just send a manual test to the model and get ready-made code" breaks down in practice over the details. An LLM can indeed assemble a working Java test, but will almost certainly not do it the way it's accepted in a specific team: it will forget about Allure, violate naming scheme, place URLs and request bodies in the wrong place, add unnecessary checks, or fail to use internal utilities like a common status verification method. For test automation, this is not cosmetics but a real loss of compatibility with the project.
The problem runs deeper than just prompt quality. Automated tests live within their own framework, bound to architecture, CI, Page Object, TMS keys, and annotation conventions. If you try to pass the entire project to the model, you quickly hit context window limitations, increasing latency and request costs.
Even after that, hallucinations, missed steps, and unstable results remain, requiring manual refinement to the desired state.
How RAG Works
Instead of static few-shot with manually selected examples, the Sber team created a plugin for JetBrains IDE. It scans the project through PSI structure, not as raw text, so it sees classes, methods, annotations, and calls. On this basis, the system collects Allure steps, existing automated tests, their TMS keys, code, and brief text descriptions. Next, these descriptions are converted into embeddings and saved along with metadata to a local knowledge base or vector store. When a new manual scenario comes into work, the system goes through several stages:
- identifies the action and expected result for each step;
- builds several semantic query variants to find similar steps;
- asks the LLM to briefly describe the entire test to find a similar automated test by meaning;
- automatically substitutes found examples into the prompt instead of manual few-shot;
- verifies the result through a repeat request to the model and through PSI inside the IDE.
This approach makes it possible to get code that looks as if it was written by someone from this team, not an external model without context. The article separately emphasizes that semantic search is needed not for the sake of fashionable RAG, but for a practical purpose: to pull into the generation exactly those steps, utilities, and patterns that have already proven their effectiveness within the project.
RAG is few-shot that finds the necessary examples on its own.
Prototype Results
According to the team, the internal prototype has already shown measurable effect. About 68% of generated tests were at an acceptable level and required only minimal edits, and overall user satisfaction was approximately 80%. The tool showed best results on simple and similar API scenarios, where it's especially important to quickly reproduce an established code template without manually copying from adjacent tests.
Automation engineers also noted a reduction in cognitive load: less time spent on routine work, more time for complex scenarios and architectural decisions. But there is no universal "autopilot" here. Complex tests with a large number of steps still tend to be simplified by the model, and for UI scenarios it needs more context, such as information about page objects.
Hallucinations haven't gone away either, so final verification remains mandatory. At the same time, the authors note that they did not see serious language limitations: similar results were obtained not only in Java, but also in Python and Gherkin. According to their estimates, such a tool can save an automation engineer more than half the time writing new routine tests.
What This Means
For AI tools in development, a more interesting stage is beginning: value is shifting from "generate any code" to "generate code that immediately integrates into a live project." The Sber story shows that in QA, teams that will win are not those who simply connect LLM to IDE, but those who wrap the model around their own knowledge base, checks, and engineering rules.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.