97 Hours on a Single GPU: An Experiment With a Self-Learning Neural Network and the Trap of Impressive Metrics
An independent developer spent 97.5 GPU hours on a single RTX 4090 trying to build an architecture that would let a language model plug in new 'skills' without
AI-processed from Habr AI; edited by Hamidun News
Imagine: you take a language model and want to add it a new capability — say, the ability to solve differential equations or write code in Rust. The standard path is fine-tuning, which requires data, computational resources, and inevitably leads to catastrophic forgetting, when a model loses old abilities while acquiring new ones. But what if skills could be plugged in like apps on a smartphone — quickly, modularly, without side effects? That's exactly the idea that an independent researcher decided to test, publishing a detailed report on Habr. The result: 97.5 hours of work on a single RTX 4090, 22 iterations of experiments, and one of the most instructive disappointments in machine learning in recent times.
The concept of modular expansion of language models itself is not new. The industry has long discussed approaches like LoRA adapters, mixture of experts, and various plugin architectures. The author's idea went further: to create a system in which a model could not just use external modules, but actually improve itself, integrating new competencies into its work without a full retraining cycle. It sounds like the Holy Grail for those working with limited computational resources — and that's the vast majority of independent researchers and small teams who don't have access to clusters of thousands of GPUs.
Technically, the architecture worked. Modules were plugged in, the system functioned stably, metrics on validation datasets looked convincing. The researcher went through about twenty iterations, each time refining the approach, and at some point the numbers became truly beautiful. Losses decreased, accuracy grew, learning curves demonstrated exactly the dynamics that any machine learning specialist wants to see. On paper, it all looked like a success.
Then came the moment of truth — testing on real tasks. And here happened what is familiar to many practitioners, but rarely spoken aloud: a model that demonstrated brilliant mastery of the "language of mathematics" by formal metrics turned out to be completely incapable of solving specific mathematical tasks. It learned to imitate the form without mastering the content. It generated plausible-looking derivations, used correct notation, built logical-seeming chains of reasoning — but the answers were wrong. This is a classic example of what the community calls Goodhart's law as applied to machine learning: when a metric becomes the goal, it ceases to be a good metric.
This case highlights one of the fundamental problems of modern machine learning — the gap between metric optimization and real competence. Language models are extraordinarily good at detecting statistical patterns and reproducing them. But reproducing a pattern and understanding the logic behind it are fundamentally different things. A model can learn that certain mathematical expressions are usually followed by certain symbols without grasping why those symbols belong there. For a researcher looking at a loss curve and accuracy, the difference is invisible until the system encounters a task requiring genuine generalization.
But the story doesn't end there, and it's the finale that makes it truly interesting. According to the author, the model ultimately "found a way out on its own" — that is, under certain conditions the system began to demonstrate behavior that was not explicitly programmed. The details of this breakthrough deserve separate attention because they touch on one of the hottest topics in artificial intelligence research: the ability of models to exhibit emergent behavior, when complex and unexpected task-solving strategies arise from simple rules. Whether this is true emergence or just a fortunate coincidence of architectural choices — the question remains open, but the fact itself deserves close study.
This experiment is important not so much for its specific results as for the lessons that follow from it. First, it reminds us of the fragility of metrics as a tool for evaluating progress. Second, it demonstrates that serious research in the field of language models is still possible on consumer hardware — albeit with significant limitations. Third, it underscores the value of publishing failures openly: the industry, obsessed with record benchmarks and press releases about the latest breakthroughs, desperately needs honest stories about how beautiful ideas break against reality. It's precisely these stories that move science forward — not victory communiqués, but careful analysis of what went wrong and why.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.