Habr AI→ original

VTB explained why AI pilots stall before production and how architecture can fix it

At Data Fusion, VTB publicly acknowledged a problem familiar to the market: AI pilots often work in demos but break at scale. The author of the piece…

AI-processed from Habr AI; edited by Hamidun News
VTB explained why AI pilots stall before production and how architecture can fix it
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

On April 8–9 at the VTB Data Fusion conference, VTB publicly acknowledged a problem familiar to almost every corporate AI customer: pilots look convincing, but few reach real production. The focus of the analysis is not the quality of an individual model, but the architecture of implementation itself.

Why Pilots Break

The key insight is simple: a pilot typically tests one step under controlled conditions, but in production, an entire chain of actions emerges where errors accumulate. If eight links operate with 85% accuracy, the overall reliability of the chain drops to 27%. In a presentation, such a system still looks "almost good," but in a real process, three out of four results turn out to be wrong, and the most dangerous part is that it's unclear in advance which ones.

That's why the problem manifests not as a one-off bug, but as a systemic degradation of quality when scaling. This also leads to a more unpleasant conclusion: the market often optimizes AI not for accuracy, but for autonomy. The metric "what percentage of tasks are completed without human intervention" is convenient for marketing and reports, but it poorly shows how well the system remains grounded in reality over the long term.

The article links this to automation bias and deskilling: people increasingly trust incorrect suggestions and simultaneously lose the skill to make decisions without the machine. As a result, the company gets not only a fragile pipeline, but also a gradual erosion of its own expertise.

Architecture with Humans

Instead of full autonomy, a low-entropy scheme is proposed, where humans are built into the system as a mandatory element, not an emergency button. It divides the work into four levels: from a field operator near the object to a domain expert who checks the model's recommendations and feeds corrections back into training.

The logic is to "offload" uncertainty at each level rather than allow it to creep uncontrollably up the chain.

  • Level 0 — an operator or specialist on-site who sees the actual object and validates input data.
  • Level 1 — narrow models for specific signals: temperature, humidity, defects, images, or other physical parameters.
  • Level 2 — a coordinator that collects model results, reasons, and formulates a recommendation for humans.
  • Level 3 — a domain expert who confirms or corrects the conclusion and thereby provides the system with a learning signal.

In such a design, the task of AI is not to replace the specialist, but to expand their scope of action and productivity. The author provides the example of a digital twin of a forest ecosystem covering more than 180 thousand hectares: as coverage grew from 2 to 50 thousand hectares, capital expenditure increased 2.1 times, operational expenses increased 2.2 times, and the team grew only from four to eight people. With a traditional approach, by the author's estimate, many more field staff would have been required.

Why API Is Not Enough

A separate point concerns the stack. The article argues that such a scheme is difficult to build on top of public APIs of only large models alone, because domain expertise should live not just in prompts or RAG, but in the weights of a locally controlled model. For this, LoRA or QLoRA adapters are proposed, which are fine-tuned on verified pairs of answers and expert preferences. After the working day, logs are validated by humans, fine-tuning runs overnight, and in the morning the system launches with updated domain knowledge.

"A prompt is forgotten at the end of the context window.

An adapter — never."

This logic bets on proprietary infrastructure. The hardware benchmark mentioned in the material is approximately 900 thousand to 1.2 million rubles: a server with RTX 4090 or 5090 for the coordinator and overnight training, several Raspberry Pi devices for narrow models on-site, and separate log storage. The main argument is not that cloud models are useless, but that they are better used as an external research tool rather than a decision-making layer in critical production loops.

What This Means

For the market, this is an important shift: the question is no longer how many people can be removed from the process, but how to maintain quality while scaling AI. If this logic takes hold, corporate implementations will increasingly be built around local models, continuous verification, and human-machine loops, rather than around promises of full autonomy.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…