ecom.tech compared evolutionary fine-tuning of Qwen3-4B with SFT and GRPO for Kotlin tests
ecom.tech attempted to fine-tune Qwen3-4B-Instruct for generating unit tests in Kotlin using an unconventional approach — Evolution Strategies instead of…
AI-processed from Habr AI; edited by Hamidun News
The ecom.tech team tested whether a small Qwen3-4B-Instruct model could be made to write useful unit tests for Kotlin backends not through standard supervised fine-tuning, but through an evolutionary algorithm called Evolution Strategies. The practical result was strong: in the test generation task, this approach outperformed both supervised fine-tuning and GRPO in terms of final reward and coverage. But alongside the win in specialization, the researchers saw the flip side: the better the model was tuned for a narrow task, the more noticeably it lost some of its general capabilities.
The motivation for the experiment was quite practical. Within their code generation service, the team faced a typical problem: LLMs first produce working code, then write tests for it that look plausible but don't follow internal conventions and don't always check truly important business logic. To assess whether this could be fixed through fine-tuning, the researchers assembled a dataset of 1,500 examples: 1,300 for training and 200 for testing. The model received not only the class being tested but also full context around it, collected by an agent based on qwen-code, and had to output a ready-to-use unit test file.
They used two metrics for evaluation. The first was Coverage, but not in the familiar sense of line coverage—rather as functional coverage: how well the generated test actually covers the same public functionality as the reference test. The second was CodeBLEU, a metric that looks not just at token matches but also at syntax and data flow in the code. Since standard CodeBLEU doesn't support Kotlin, the team had to add this support separately through tree-sitter-kotlin and a custom set of keywords.
The reward function was simple: 0.6 weight went to CodeBLEU and 0.4 to Coverage, to account for both code form and practical utility. The essence of Evolution Strategies in this experiment worked as follows: instead of gradient-based updates, they took about 30 perturbed copies of the base model, adding Gaussian noise to the weights, then made each copy generate an answer in deterministic mode and evaluated it with the reward. After that, the base weights were shifted toward the changes that produced the best results. This approach is simpler to parallelize, requires no heavy gradient storage, and according to the authors, is less prone to reward hacking.
They used an open Evolution Strategies at Scale project with vLLM acceleration and trained the model on a cluster of 8 H100s. Due to the cost of a full pass through the dataset at each iteration, they introduced batching: randomly selecting 32 examples per batch.
The experiment showed noticeable improvement already after 500 iterations. By the end of training, CodeBLEU had increased by 21.3% relative to the base model, and Coverage by 18.6%. ES's best result gave coverage of 0.7381 and maximum final reward; by the selected metrics, it outperformed not only SFT and GRPO, but even the larger Qwen3-Coder-480B.
The picture with competing methods was telling: SFT produced syntactically clean tests but struggled to hit the needed logic, while GRPO actually degraded on both metrics in this setup.
For a narrow engineering task, the conclusion looks straightforward: evolutionary fine-tuning can indeed be a working tool even for a relatively small model. But then came the less pleasant part.
Against the backdrop of recent work on catastrophic forgetting, the team separately checked what happens to the general knowledge of the fine-tuned model. They ran the ES version of Qwen3-4B-Instruct through GPQA—a challenging scientific benchmark. The drop in accuracy averaged 2.1% in zero-shot and 5.3% in five-shot chain-of-thought. The ability to use contextual hints was particularly affected: the benefit from few-shot examples dropped by 41–72%.
The hypothesis aligns with what other research shows: ES makes dense changes to almost all model weights, which helps it solve the target task better but causes it to deviate more from the baseline and forget some prior skills.
What does this mean in practice? Evolution Strategies looks not like a universal replacement for RL, but as a powerful specialized tool for companies that care more about locally squeezing maximum performance from a model for a specific pipeline. If there's a clear reward function, sufficient computational resources, and tolerance for a trade-off in general capabilities, ES can already deliver meaningful gains.
But for product teams, it's also a reminder: improving quality on one task isn't free, and the battle ahead will be not just about new metrics, but about ways to fine-tune models without losing their baseline flexibility.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.