T-Bank improved AI Code Completion quality with a filter and removed unnecessary suggestions
T-Bank showed that improving the quality of AI suggestions depends not only on generation. The team added a filter that decides whether to show a completion…
AI-processed from Habr AI; edited by Hamidun News
T-Bank shared how it rebuilt its internal AI Code Completion for 7,500 developers and increased the share of accepted suggestions without changing the core generation model. Instead of another attempt to make completion "smarter," the team added a separate filter that decides whether to show a suggestion at all.
Where Was the Ceiling
T-Bank's code autocomplete service has been running in production for several years and is used daily by almost all internal developers—about 7,500 unique users. The basic metric, Acceptance Rate, long hovered around 20%: roughly every fifth suggestion was accepted, the rest ignored. The team tried lengthening suggestions, changing the generation strategy, and expanding the number of places to show suggestions, but this created more noise.
The more actively the system suggested, the more often developers saw useless continuations and the weaker their trust in the product became. This led to a different hypothesis: the problem might not only be in generation quality, but in the absence of a separate mechanism to determine when exactly a suggestion should be hidden. The team noticed an important behavioral effect: if the noise decreased, users started pressing Tab more willingly and gave more chances to even non-obvious proposals.
By internal assessment, each additional percentage point of AR over time added about 2% to the number of acceptances. But there was a strict business limit: the filter should not immediately eliminate more than 5% of already-accepted suggestions.
How They Built the Filter
The first step was a quick baseline on CatBoost. The model was trained as a binary classifier: accept the suggestion or not. They used only those features that could be calculated in real time without storing request history: IDE, programming language, cursor position, suggestion type, prefix and suffix size.
Even this simple variant gave about +2.3 percentage points to Acceptance Rate offline and confirmed that the task indeed had a strong signal. Next, the team moved to a text filter based on Qwen2.
5-Coder 1.5B. Larger models did not fit within production constraints: the target of 30 requests per second on a single Nvidia A100 and p90 latency no higher than 50 ms.
So they chose a compromise: compact enough for inference, but still tailored for code. To prevent the model from confusing file context with the suggestion itself, the input had to be strictly structured and fine-tuned not for generation, but for classification.
- Replaced the generation head with binary classification
- Tagged context for prefix, line, answer, and suffix
- Encoded IDE, language, and cursor position with special tokens
- At the final stage, added fine-tuning, LoRA, and focal loss due to class imbalance
This pipeline improved quality in steps: after strict structuring, the gain grew to about +3.9 p.p., after adding special tokens—to +5.1 p.p., and full fine-tuning brought the offline result to +6.8 p.p. The most important thing turned out to be not just fine-tuning, but how the input was packaged: the model began to better distinguish where the file context was and where the suggestion to evaluate was, rather than rewrite.
What Broke in Production
On synthetic tests everything looked great, but a shadow run quickly dampened expectations. Simple ONNX conversion of the model almost tripled throughput and reduced response time to about 30 ms, but on real traffic, latency in peaks again jumped to 90 ms. The cause turned out to be not the model itself, but the load profile: in production, bursts of almost simultaneous requests came in, which were not present in tests. The problem was solved through Triton and dynamic batching with a small batch size and short queue wait time.
"Offline is necessary, but shadow running is the only place where
reality begins."
After this, a second layer of problems was uncovered: the filter turned out to be too aggressive. To keep the spike in accepted suggestions within 5%, the threshold had to be retrained on a weekly window of data rather than a few days. Then on top of the LLM, another CatBoost was added, which received the main model's score, tabular features, and historical signals like the interval between requests and changes in prefix length.
For this, user state was stored in Redis. Along the way, the team caught a typical engineering mistake: part of the features in production were calculated in bytes, and part in characters. After aligning the logic, an A/B test showed 4.
7% dropped traffic and +5.2 p.p.
to Acceptance Rate without skew by language and IDE.
What This Means
T-Bank's case shows well that the next quality improvement in AI tools does not always come from a new large model. Sometimes a separate decision layer that stays quiet at the right time brings greater effect. For products with high usage frequency, this is also a matter of trust: if you remove unnecessary suggestions, users not only get annoyed less often, but over time more often accept useful options. At the scale of thousands of developers, this quickly turns into noticeable time savings.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.