Habr AI→ original

Habr AI and Spar: how to test ML systems when data drifts and breaks predictions

Habr AI published a practical breakdown of testing ML systems using an automated ordering service for Spar as an example. The main takeaway: in projects like…

AI-processed from Habr AI; edited by Hamidun News
Habr AI and Spar: how to test ML systems when data drifts and breaks predictions
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

Habr AI published a practical guide to testing ML systems—not in theory, but on a live auto-ordering project for the Spar retail chain. The author demonstrates that such products don't just break in the models themselves: errors hide in the data, seasonality, integrations, and even in metric selection.

Why This Is Difficult

In classical QA, you can take requirements, prepare test cases, and compare the actual result with the expected one. In ML, this approach only partially works. The model doesn't output a "correct answer" by a rigid rule; it builds a probabilistic forecast.

So the tester checks not for a specific number, but for a range of acceptable error, robustness across different data slices, and the impact of a miss on the business. The complexity is amplified by the fact that the object being tested is not just code. If a model was trained on incomplete, dirty, or outdated data, a good algorithm will still produce poor results.

For retail, this is especially critical: demand changes due to weather, holidays, local events, and new customer habits. What worked precisely yesterday can systematically fail tomorrow because of data drift and shifts in actual customer behavior.

How They Build Control

In the Spar case, the team moved away from the idea of "finding one right answer" and relied on technical and business metrics. At the requirements stage, they first agree on what constitutes acceptable quality: for example, how much a forecast by category can deviate from actual results without real damage to sales and write-offs. Next, tests are built around ranges rather than binary pass/fail. In parallel, they check not only normal scenarios but also anomalous data that shouldn't break the forecast. In practice, control is assembled from several layers:

  • fixed library versions and containerization via Docker;
  • data anonymization to use realistic sales without leaking personal information;
  • targeted testing across different stores, formats, and product categories, not just average metrics;
  • regression of the new model against the old one so that improvement in one metric doesn't break others;
  • monitoring of infrastructure and nightly data exchanges, because the business needs not just accurate but timely forecasts.

A separate conclusion from the article is that testing ML "on average across the hospital" is useless. A model can look good on chocolate but fail on a specific brand, accurately count bread while simultaneously being wrong on sauces. So the testing goes deeper: by category, by levels of detail, and by a representative sample of stores. This approach costs more, but it gives a real picture before release rather than after complaints from the business.

Production Failures

The most instructive part of the material is the real outages. In one case, the team confused two nearly identical parameters of a seasonal algorithm: prediction_share and predict_share. That was enough for the system to dramatically overestimate the forecast for butter, sour cream, and cottage cheese.

Excess dairy products arrived at stores, and part of the inventory had to be quickly discounted because of short shelf life. The error was small at the code level but expensive at the operational business level. There was also the opposite case—an underestimate for lavash after release.

Weekly seasonality started being calculated incorrectly, and the demand peak "moved" from weekends to the middle of the week. Due to low sales volumes, the problem wasn't noticed immediately, but for customers the effect was simple: the product disappeared from shelves exactly when they needed it. Another failure happened at the beginning of 2025: the model incorrectly interpreted the year field and essentially "didn't understand" that a new year had arrived, and the anomaly detection system didn't catch it.

The conclusion is harsh: ML must be tested not only on known data but also on future periods, new value ranges, and failures of protection mechanisms.

What This Means

The Habr AI article clearly demonstrates a shift in how QA for ML is understood. Here, running test cases against code isn't enough: you need a combination of metrics, data, monitoring, and business context. For teams deploying forecasting in retail, logistics, or fintech, this is no longer an additional discipline but a mandatory layer of protection against expensive and silent errors.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…