Google and OpenAI Hit the Limit: What Happens When the Internet Runs Out of Human Text
Generative AI faces a peculiar dependency: models require human-written text, yet simultaneously reduce the incentive to create it. AI summaries in search…
AI-processed from Habr AI; edited by Hamidun News
The main vulnerability of generative AI is that it thrives on human text while simultaneously destroying the conditions under which this text is created. As long as search engines, chatbots, and AI summaries promise users quick answers without visiting the original site, they reduce the revenue of those who produce the original material. In the short term, models win through convenience, but in the long term they risk being left without a quality training base and starting to learn from their own reflections.
The first problem is publishing economics. After Google launched AI Overviews in May 2024, search began increasingly answering directly in the results, without sending readers to the source. According to Chartbeat data published by Axios on March 17, 2026, small websites with traffic of 1 to 10 thousand views per day lost about 60% of search referrals in two years.
Medium-sized sites dropped by 47%, large ones by 22%. A Pew Research Center study from July 22, 2025 showed a similar effect at the user behavior level: when search results include an AI summary, people click on regular links significantly less often. For media, forums, niche blogs, and independent authors, this is not an abstract metric, but a direct blow to advertising, subscriptions, and motivation to continue writing.
The second problem is the data itself. Large language models initially grew on giant arrays of internet text. Early systems like GPT-3 had their main training corpus assembled from the web and related sources.
But the volume of quality human-generated content is not infinite. In June 2024, Epoch AI researchers estimated that at previous scaling rates, the industry could hit the limit of publicly available text suitable for training between 2026 and 2032. This is why major players began signing deals with Reddit, publishers, and other owners of large archives: access to data transformed from a technical detail into a strategic asset.
Against this backdrop, the temptation to switch to synthetic data seems almost inevitable. If there isn't enough real text, it makes sense to ask one model to generate material for the next. The problem is that such a scheme gradually degrades quality.
A paper published in Nature on July 25, 2024, describes the model collapse effect: with recursive training on machine-generated data, models begin to lose rare facts, smooth out complex patterns, and amplify existing errors and biases. It's like copying the same page multiple times: the overall meaning is still visible, but details fade with each iteration. Even OpenAI publicly acknowledged that synthetic data can help in specific cases, but doesn't look like a complete replacement for diverse human-generated text.
There's yet another trap: separating human text from machine-generated text is much harder in practice than it appears. AI content detectors still make mistakes, especially on short, edited, or stylistically neutral texts. Some studies showed high false positive rates on texts from people for whom English is not a native language.
This means the industry will have difficulty simply 'cleaning the internet' and selecting only reliable human data. Moreover, recent academic assessments are already noting that the share of AI assistance in new publications is growing rapidly, and online content itself is becoming more monotonous in meaning and more sterile in tone. In other words, the problem is not just the quantity of text, but its diversity.
If this cycle is not broken, the internet will start working worse for all participants. Authors will publish in-depth material less frequently because it becomes harder to monetize. Platforms will continue to fill search results with brief AI summaries, saving users a click, but impoverishing the ecosystem of sources.
And model developers will get more and more secondary content that sounds confident but carries less new knowledge. The solution seems to lie not in even greater volume of generation, but in preserving incentives for human writing: through licensing payments, transparent attribution, more careful use of AI summaries, and prioritizing data quality over raw scale. Otherwise, AI will indeed end up in the trap it has built for itself.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.