Language model distillation: can knowledge theft be proven through chat
Amid Anthropic’s accusations against Chinese developers over the distillation of Claude, an intriguing study has appeared. The author tested whether it is possi
AI-processed from Habr AI; edited by Hamidun News
In the world of large language models, a conflict is brewing that resembles patent wars in pharmaceuticals—except instead of molecules, knowledge packaged in billions of parameters is being stolen. Fresh research published on Habr poses a provocative question: is it possible, simply by conversing with a language model in a chat, to determine that it was trained using distillation on the output data of another model? The author believes so. And if he's right, it changes the rules of the game for the entire industry.
To understand the context, one must return to the scandal that erupted several months earlier. Anthropic—the creators of Claude—publicly accused Chinese developers of systematically distilling their model. The essence of the claims was that engineers from China were massively using the Claude API, collecting its responses to train their own models. Anthropic stated that they discovered this through account monitoring: analyzing request patterns, usage history, and connections of accounts to Chinese companies. The evidence base was built at the infrastructure level—who, when, and how many requests were sent.
But the author of the research took a completely different path. He wondered: what if the evidence is hidden not in server logs, but in the model itself? Distillation is a process where a small student model is trained to reproduce the behavior of a large teacher model. Essentially, it's knowledge compression: instead of training a model on terabytes of raw data, the developer feeds it ready-made answers from a more powerful system. The student model not only adopts facts, but also stylistic features, logical chains, characteristic turns of phrase, and even the teacher's errors. It is these traces—a kind of "fingerprints"—that the researcher attempted to detect through so-called model self-reporting.
The methodology looks elegant in its simplicity. If a model was trained on Claude's responses, it may involuntarily reproduce patterns characteristic of Claude: specific refusal formulations, recognizable reasoning structure, certain ethical frameworks that Anthropic builds into its product. This is similar to how a linguist can determine where a person grew up based on speech peculiarities—except here we're talking about a neural network's "training region." The research author claims to have discovered such markers, though he makes an important caveat: the results are preliminary in nature and cannot serve as legal evidence.
This caveat is not merely routine politeness, but a reflection of a fundamental problem. Language models remain largely black boxes even to their creators. No one can say with absolute certainty why a model produced exactly that answer. The coincidence of stylistic patterns could be the result of distillation, or it could be a consequence of training on similar data from open sources. Two models trained on the same scientific papers and books will inevitably resemble each other, and this has nothing to do with intellectual property theft.
Nevertheless, the direction of research itself is extremely promising. The industry urgently needs tools for verifying model origin. Today the market is flooded with open-source models, many of which suspiciously well handle tasks that theoretically require significantly greater computational resources. If methods of "linguistic expertise" for neural networks are perfected, this could become the foundation for a new direction—AI forensics, digital criminology in the world of artificial intelligence.
For major labs like OpenAI, Anthropic, and Google DeepMind, the stakes are colossal. Training frontier models costs hundreds of millions of dollars, and if competitors can achieve comparable quality for a fraction of these costs through distillation, the entire economic model collapses. It is no coincidence that the user agreements of most major providers already contain explicit prohibitions on using output data to train competing models. But a prohibition without a mechanism to detect violations is just words on paper.
The research, despite its preliminary nature, points to a future where models will carry indelible traces of their origin. Perhaps in time, developers will begin deliberately embedding hidden watermarks in their models—unique patterns of responses that cannot be removed through distillation. Some companies are already experimenting with such techniques. If these methods become reliable, the world of AI development will gain something it critically lacks now: a mechanism of accountability. For now, the industry balances on a thin line between open knowledge exchange and protection of investments—and this line grows thinner with each passing month.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.