Wildberries showed how a local AI agent started writing unit tests for Android
Wildberries shared a practical case study of deploying a local AI agent in Android development. On the first attempt, the model produced tests that would not…
AI-processed from Habr AI; edited by Hamidun News
Wildberries & Russ published a practical breakdown of how an Android team lead transformed a local LLM from an interesting curiosity into a working assistant for unit tests. The experiment took about two months: initially the AI consistently made errors and hallucinated, but after reconfiguration it began producing compilable tests and even corrected code review comments on its own.
From Skepticism to Task
The author describes a path familiar to many developers: from the initial euphoria around ChatGPT in 2023 to burnout from the hype and doubts about whether neural networks actually deliver productivity gains in large-scale product development. After reviewing talks, articles, and case studies, he came to a more grounded conclusion: in existing codebases, the discussion is typically not about replacing a team, but about efficiency gains of roughly 10–20% if the tool is properly integrated into the workflow.
"The difference between a junior and a senior in the field of neural
network usage is 3–4 months."
From this perspective, the author chose not an abstract goal of "implement AI," but a very specific and widely disliked task: writing unit tests for an Android project in Kotlin. An important constraint: only a local LLM or a model in a corporate environment could be used. Due to security requirements, cloud scenarios were ruled out, where project code is indexed and sent to external servers, so the bet was placed on tools compatible with local infrastructure.
Why It Failed
The first attempt looked logical but flopped. The developer assembled a large system prompt, described the project in detail, added a markdown file with instructions for the agent, and a template prompt for tests, assuming that maximum context would yield maximum quality. In practice, the local model wrote a test for a small mapper in about 20 minutes, and produced not a working result, but a collection of warnings, import errors, and non-compilable code. The larger the class, the worse the hallucinations grew and the less the model understood which testing tools even needed to be applied. The main problems of the first iteration were:
- overly detailed instructions overloaded the model and it ignored important requirements;
- Gradle output after build and test runs clogged the context and degraded subsequent responses;
- running multiple sessions in parallel didn't accelerate work, but simply multiplied the number of broken tests;
- attempts to immediately test large ViewModels ended in loss of focus and fictional solutions.
Manual refinement of such results didn't save the situation either. Fixing a generated test often took longer than writing it from scratch. Ultimately, the first phase yielded a useful but unpleasant conclusion: an LLM by itself doesn't become a reliable assistant simply because it was given a long prompt and access to code. What was needed was a different context structure, noise filtering, and clearer division of work within the agent system.
What Actually Worked
In the second iteration, the author changed not just the model but the approach itself. He added RAG with a project index to speed up finding relevant files, and configured Gradle output processing so that only useful information entered the prompt, not all the technical build noise. After that, on top of OpenCode, a structure of a main agent, sub-agents, and separate skills was built.
The idea is simple: instead of one giant instruction, the model receives short rules for each specific work stage. The test generation system was divided into three roles: a planner analyzed the class and collected todos with test cases, a test writer implemented the checks themselves, and a reviewer verified compilation, coverage completeness, and style compliance. It was only after such decomposition that the model produced a ready-to-use compilable test for a small mapper for the first time.
The merge request passed review with a style comment, that comment was added to the corresponding skill, and the fix was performed by the model itself. On local hardware, one test still took about 19 minutes, but this was already useful in the real workflow rhythm: the agent could write tests while the developer was on a call. Later, after running on more powerful infrastructure, the time was reduced to roughly 5–10 minutes per test.
What This Means
The Wildberries case demonstrates that a corporate AI agent starts delivering value not after "magically" connecting a model, but after engineering configuration: with a project index, noise filtering, roles, and mandatory result verification. Full replacement of developers is far off, but routine tasks like unit tests or small fixes can already be delegated to the machine—especially where local containment and code control are important.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.