The Verge→ original

The Atlantic opened a search tool for 21 million tracks used to train AI

A journalist at The Atlantic found four music datasets for training AI models — 21 million tracks in total. Google and Stability AI confirmed in research…

AI-processed from The Verge; edited by Hamidun News
The Atlantic opened a search tool for 21 million tracks used to train AI
Source: The Verge. Collage: Hamidun News.
◐ Listen to article

Journalist Alex Reisner from The Atlantic published the results of an investigation: he identified four datasets with music that technology companies used for training generative AI models. Moreover, he made all four databases available for public search. Now anyone can check whether their tracks ended up in the training data.

What Reisner Found

Two of the largest datasets are striking in their scale: one contains 12 million tracks, the second — 9 million. Combined, that's 21 million music files in just two databases. Two other datasets are more modest, but still significant: each includes more than 100,000 recordings. Collectively, this is a colossal volume of content — much of the musical heritage that can be gathered automatically. All four datasets were downloaded thousands of times. It's impossible to establish exact users, but Google and Stability AI officially confirmed in their scientific publications that they worked with this data. This is documentary evidence: companies with multi-billion dollar valuations relied on the same sources that are now public.

Where This Music Comes From

The sources of the datasets vary in legal status — and this is where the most important part begins:

  • Free Music Archive — free for personal listening, but commercial use and creation of derivative works are restricted
  • Some tracks are published under Creative Commons licenses, but specific conditions vary for each track
  • Some materials are protected by standard copyright — without exceptions or caveats
  • All databases were technically accessible for download without any restrictions
  • No AI company has publicly disclosed the exact composition of its music training datasets

The gap between "technically available for download" and "legally allowed to use for commercial AI training" — that's precisely the legal space in which lawsuits are now unfolding around the world.

Tool for Rights Holders

The Atlantic launched a public search tool across all four databases. Any musician, producer, label, or publisher can check their name or track titles and get an answer: was this content part of the training set? This is important from a practical perspective.

Lawsuits against AI companies — Suno, Udio, OpenAI, Stability AI, and others — are already being heard in courts, but plaintiffs still haven't had a reliable way to prove that specific works were used. The Atlantic's public database can become evidence material in these cases. Reisner's investigation continues a series of exposés from recent years.

First, it became known about mass use of books without permission (the Books3 dataset), then — about texts from the open web (Common Crawl). Now it's music's turn. The logic is the same: AI companies collected everything that was technically available without asking about legal status.

What This Means

The Atlantic's publication translates the copyright dispute in AI from abstract to concrete: here's the data, here are the companies, here are the tracks. For musicians, this is the first public verification tool. For AI companies — a signal that opacity regarding training data is becoming increasingly difficult to maintain.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Need AI working inside your business — not just in your newsfeed?

I build production AI for companies — custom CRM, internal tools, autonomous agents, workflow automation. Owned by you, shaped to your process, no per-seat tax. Built by Zhemal Khamidun, CPO of AlpinaGPT (AI platform, 6,000+ users).

What do you think?
Loading comments…