Habr AI→ original

Can LLMs find flaky tests from code? Study says no

The study tested whether LLMs can find flaky tests — tests that fail for no clear reason. The result was disappointing: strong dataset metrics do not mean the m

Can LLMs find flaky tests from code? Study says no
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

Flaky tests are tests that sometimes fail, sometimes pass, without apparent reason. They break CI/CD, force rework, and undermine confidence in automated tests. Researchers decided to entrust this problem to LLMs: can a neural network understand the code and find suspicious tests? The results were disappointing.

Why flaky tests are a worse enemy

An unreliable test is not just a bug. When a test fails at random moments, engineers stop trusting it. They redo work, restart the pipeline, spend hours debugging. A classic bug can be reproduced; a flaky test only reproduces on Monday at 3:43 AM. This kills development speed.

Sources of flaky tests are diverse and often hidden:

  • Race conditions and timing issues
  • Dependencies on database or file system state
  • Poorly isolated tests affecting each other
  • Asynchronous code without proper waits
  • Hard timeouts that don't tolerate server overload

How researchers tested LLMs

A team took several LLMs and asked a simple question: is this test code flaky? The models looked at the source code, tried to identify patterns (retry logic, sleep, poor isolation), and output a problem probability. On a controlled dataset, the results looked excellent: models achieved 85%+ accuracy. Precision and recall were good, graphs looked like typical successful ML projects. It seemed the problem was solved.

But here's the paradox: when researchers applied the same models to real tests from other projects, the effect nearly vanished. Accuracy dropped, false positives increased. The model clearly didn't understand the nature of flaky behavior.

Why metrics don't equal understanding

This is a classic machine learning trap, forgotten between reading articles about new models and real work. A model can learn correlations in a dataset, but that doesn't mean it understood the cause. For example, if all flaky tests in the training dataset contained `Thread.sleep()`, the model will flag any test with that function as suspicious — even if the sleep is correctly used for synchronization.

For flaky tests, the problem is acute: each project has its own violation patterns. What breaks in a microservices architecture may be completely normal in a single-threaded application. Models were trained on one slice of data and don't see environmental context, framework versions, or infrastructure specifics.

Good metrics on a test set are necessary but not sufficient.

You need real validation on production examples.

What this means

LLMs are powerful tools, but they're not magic. For specialized problems like finding flaky tests, you need either more context (history of failures, environmental metadata, load information) or a hybrid approach (LLM + static analysis + monitoring of failures in production). The moral is simple: don't rely on metrics alone. Specific tasks require a specific approach.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.
What do you think?
Loading comments…