Skip to post content
3 min readDaniel Kosbab

Training on AI-generated data is the next mess

The open web is filling with text generated by LLMs.

As of 2026, a non-trivial fraction of the English text published online in the last three years is AI-written or AI-edited. The next generation of frontier models will be trained on a corpus that includes that text. The generation after that will be trained on text generated by the previous generation.

This is a compounding problem and it has a name.

Model collapse

Model collapse is the observation that models trained on their own outputs degrade. First the tails of the distribution disappear. Rare words. Rare facts. Then the mode sharpens. The model produces more of what it is good at, less of everything else. A few generations of this and the model is fluent in a narrow subset of the original distribution, with the rest lost.

This was first studied under controlled conditions with small models. The failure mode is real and reproducible. Whether it happens at the same magnitude in frontier pre-training is debated. The mechanism is not exotic. It follows from statistics.

If you train a model to match a distribution, and the distribution you observe is the output of a previous model, you inherit that model's bias. Stack across generations, the bias compounds.

Why scale alone doesn't save you

Bigger models and more data are the usual response to problems in deep learning. This one is different.

The data side of "more data" is what is getting poisoned. Crawling more of the web in 2028 means crawling more AI-generated text than crawling the same web in 2023 did. The marginal data point has less signal than it used to.

The model side of "bigger models" doesn't help because the signal in the data is what is eroding. A bigger brain trained on a thinner signal is still trained on a thinner signal.

The only way out is: distinguish AI-generated from human-generated text, and weight them differently.

What labs are doing

Three approaches in use, none fully solved.

  1. Watermarking. Frontier models try to embed detectable signals in their output. In theory, downstream trainers can filter watermarked text. In practice, watermarks are brittle against paraphrase, translation, and adversarial editing. They survive sometimes. Not reliably.
  2. Provenance signals. Trust certain sources (journalism, academic papers, verified publishers) more than scraped web text. Works to a point. But legitimate sources use LLMs for drafts now. Provenance gets less informative over time.
  3. Human preference data. Train on human ratings of outputs, not on text scraped from the web. This is what RLHF is. Expensive and does not scale to pre-training volumes.

Each is a partial fix. Together they slow the problem. None reverse it.

The quieter symptom

Even before model collapse becomes dramatic, there is a subtler effect.

If 30% of your training corpus is AI-generated, the model learns the statistical shape of AI-generated text. Its own output becomes more AI-shaped. Which becomes more of the corpus for the next generation. The distribution drifts toward a narrow mode that reads as "smooth, polished, slightly generic." The kind of text you can recognize without knowing why.

This is happening now. Readers notice. The feeling that a piece of prose is AI-written is a real signal, not a bias. The signal is strengthening because the AI's own text is becoming more of what AI text is trained on.

The position

Treat human-generated content as the scarce resource it is becoming.

For training: invest in provenance and human labor. Synthetic data helps on specific narrow tasks. It does not replace the open corpus the field grew up on, because the open corpus is being consumed.

For writing: the value of clearly human-written prose is going up. "Sounds like a person" will differentiate. The generic voice is about to become very common.

For readers: the filter skill you are developing (noticing AI-shaped text within a sentence) is going to matter. Expect more of it, and expect it to get harder to spot as the training loop tightens.

The signal and the noise are converging. What stays signal is what was never on the web in the first place.

© 2026 Daniel Kosbab

Built with love and Tailwind CSS