Habr AI→ original

Project Panama: Anthropic пускает миллионы книг под нож ради обучения Claude

Пока OpenAI судится с издательствами, Anthropic пошла по более радикальному пути. В сеть утекли детали «Project Panama» — секретной программы по оцифровке милли

AI-processed from Habr AI; edited by Hamidun News
Project Panama: Anthropic пускает миллионы книг под нож ради обучения Claude
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

Imagine a huge warehouse filled with pallets of books. But this is not a library or a quiet archive. Here they don't read — here they dissect. People in protective suits carefully cut the spines off brand new volumes, turning the covers into stacks of loose sheets that then disappear into the depths of high-speed scanners. This is not a scene from Ray Bradbury's dystopia, but the everyday work of a secret division of Anthropic. The project was given the code name Project Panama, and its details have just surfaced in court documents, causing the industry to shudder slightly at the scale of what's happening.

Anthropic has long built an image of being the "ethical" alternative to OpenAI. While Sam Altman aggressively vacuumed the internet, the creators of Claude spoke about safety and responsibility. However, it turned out that when it comes to a shortage of quality data, ethics give way to an industrial meat grinder. Court documents revealed that in early 2024, the startup's leadership launched an ambitious plan for "destructive scanning of all the world's books." The wording sounds ominous, but from a technical standpoint it's justified: to scan quickly and without distortion, the book needs to be literally destroyed, turned into a set of flat pages.

Why such complications when digital versions exist? The answer lies in quality and rights. Legal digital libraries are expensive and protected by strict licenses, while pirate archives like Shadow Libraries often contain OCR errors. To train models at the level of Claude 3.5 or the future Claude 4, you need clean, structured knowledge. Anthropic decided that it's simpler and cheaper to buy millions of physical copies, turn them into dust, and get perfect digital copies, than to negotiate with each rights holder individually. The operation budget was tens of millions of dollars — a sum comparable to the cost of purchasing H100 chips.

This situation highlights the main problem of the modern AI industry: the "data wall" is not a myth, but a reality. Humanity has already fed neural networks almost all of Reddit, Wikipedia, and digitized newspaper archives. But the appetites of models are growing exponentially. If before we talked about how AI would replace writers, now we see how AI literally devours their physical legacy. The irony of the situation is that a startup valued in the billions of dollars is forced to engage in the logistics of waste paper to gain an advantage of a few percentage points in chat bot accuracy.

The secrecy of Project Panama is explained simply: it looks terrible from a PR perspective. It's hard to sell to the public the idea of "safe AI" built on the ruins of destroyed books. Anthropic's lawyers probably hoped that physical ownership of a book would give them some loophole in the "fair use" law. Like, we bought the book, we have the right to read it, even if the "reader" is an algorithm and the reading process requires destroying the medium. However, courts are unlikely to be so favorable to mass industrial copying.

What does this mean for us? We have entered an era where information in the physical world is becoming more valuable than digital dust. If before we digitized books to preserve them for posterity, now we do it to feed them to a "black box" that will give us a summary in a chat. This is a radical shift in the culture of knowledge consumption. Soon we may face a shortage of rare editions simply because yet another AI unicorn decided to buy up the entire print run to train its new "language machine."

Bottom line: Anthropic has shown that in the battle for data, no prisoners are taken. Are we ready for the fact that the intelligence of the future will be built on the ashes of burned books?

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…