How AI’s Future Depends on Data Provenance

The Low-Background Steel Era of AI

In my last piece I argued that the internet was never a clean dataset, and that AI is now polluting its own water supply by training on its own exhaust. Several people asked the reasonable follow-up question: so what happens next? Does the whole thing degrade into fluent mush?

No, and the reason why is the actual story. Predicting the future is a great way to look stupid in public, so I will at least state the core argument up front where it is easy to quote back at me later: data is shifting from a free resource you scrape into an asset you verify, manufacture, or buy. That one shift rearranges who wins. The next phase of AI belongs not to whoever has the most data, but to whoever can prove where their data came from and whether it is true.

Everything below is that single idea applied four ways: to old data, to synthetic data, to human expertise, and to the web itself.

Pre-2022 Data Is the New Low-Background Steel

After 1945, atmospheric nuclear testing contaminated the entire planet’s steel supply with trace radiation. For decades, anyone building radiation-sensitive instruments needed steel forged before the bombs, which meant salvaging it from pre-war shipwrecks. Old battleships became quarries. Humanity irradiated its own building material and then went diving for the clean stuff. Keep that image handy, because we just did it again, faster, and on purpose this time.

The same dynamic is now playing out with text. Data created before roughly 2022 is provably human in a way nothing after it can be. It cannot be contaminated retroactively. Watch where the money is going: Reddit licensed its archive to Google in a deal reported around sixty million dollars a year, which means two decades of arguments about whether a hot dog is a sandwich are now a strategic AI asset. OpenAI has signed multi-year content deals with News Corp, Axel Springer, the Associated Press, and Stack Overflow, some reportedly running into nine figures. These are companies paying serious money for the one thing they cannot generate themselves: text with a verifiable human origin and a timestamp that predates the contamination. The industry has already concluded that verified human data is a finite, non-renewable resource. Nobody is putting it that bluntly in a press release, because press releases are where bluntness goes to die.

If you have twenty years of forum posts, technical documentation, or internal knowledge bases with clean provenance, you are sitting on an asset. You probably did not know it was an asset in 2021. The shipwreck did not know it was valuable either.

The Twist: Synthetic Data Is Not the Villain

Here is the part that gets lost in most model collapse coverage, usually somewhere between the scary headline and the stock photo of a robot. The frontier labs are training on enormous amounts of synthetic data right now, deliberately, and it is working.

It helps to be precise about terms, because “AI-generated data” covers two very different things. Contaminated data is model output ingested indiscriminately from the open web, with no way to separate signal from plausible-sounding noise. Synthetic data is model output generated deliberately, with a verifier attached. The first causes collapse. The second is a manufacturing process.

The difference is control. Deliberate synthetic data comes with a quality gate: generated code that has to pass a test suite, math that has to check out, reasoning chains graded against known answers. The generator can hallucinate all it wants; the verifier throws the garbage out before it enters the training set.

This reframes the whole problem. The threat was never synthetic data. The threat is contaminated data, and more broadly anything unverifiable. Which means the next competitive moat is not scale, it is filtration. The labs that build the best verifiers, graders, and quality gates get to manufacture training data on demand while everyone else fights over a shrinking pool of verified human data. Data stops being something you harvest and becomes something you manufacture under QA. Anyone who has worked in an actual factory knows QA is where the hard problems live.

The Long Tail Gets Bought, Not Scraped

My last piece pointed out that model collapse washes out the long tail first: rare expertise, minority perspectives, the unusual-but-true. That knowledge does not stop existing. It stops being free.

We are already seeing the early shape of this. Data labeling firms like Scale AI and Surge are recruiting physicians, attorneys, and senior software engineers at reported rates well north of a hundred dollars an hour, not to label cat photos but to produce expert reasoning and grade model outputs in their specialty. Expert knowledge that used to leak onto the open web for nothing is becoming a labor market with rate cards. The internet spent twenty years convincing experts to give their knowledge away in exchange for upvotes and the occasional “thanks, this fixed it” on a six-year-old thread. AI is about to spend the next ten paying them to do it on purpose, under contract, with provenance attached. Somewhere, every engineer who ever answered a Stack Overflow question for free is doing the math on back royalties.

There is an irony here worth sitting with. The scraping era treated human expertise as a free commons. The collapse problem is forcing the industry to put a price on it. The long tail is about to get a W-2.

The Web Bifurcates

Forward five years, and I think the open web splits into two layers. A gray layer of unattributed, statistically average, machine-flavored content that nobody trains on and increasingly nobody reads. You have already scrolled past a few thousand specimens this week; some of them had very confident opinions about leadership. And a verified layer: provenance-signed content, gated communities, identity-backed publishing, places where being a confirmed human is the price of admission.

The standards work is underway. C2PA content credentials, backed by Adobe, Microsoft, and the BBC among others, cryptographically bind a piece of content to its origin and edit history. Leica already ships cameras that sign photos at the moment of capture. The technology is the easy part. The hard part is adoption, and adoption will come from economics rather than principle. When verified human data commands a licensing premium and unverified content is worthless as training data, publishers will sign their work for the same reason food producers print lot numbers.

The strange consequence for individuals: your authentic, verifiable, human-authored writing is about to become more valuable, not less. Not because it is better than what a model can produce, sometimes it will not be, but because it is provably yours and provably human. Provenance becomes a feature of the content itself.

What to Do With This

If you build systems: treat your data pipeline like a supply chain. Software engineering went through this with the SBOM, the software bill of materials, after enough supply chain attacks made “where did this dependency come from” a board-level question. Data is next. Log provenance on everything you generate and everything you ingest, even if nothing downstream consumes it yet. Retroactive provenance does not exist. Ask the steel industry.

If you evaluate vendors: last time I suggested asking about provenance strategy. Add a second question: what is your verification strategy for synthetic data? A vendor who says “we don’t use synthetic data” is either two years behind or lying, and you should care which, because one of those is fixable. A vendor who can describe their verifiers without checking with the booth staff is telling you something real about their engineering culture.

And if you create things: keep receipts. Drafts, timestamps, version history, anything that anchors your work to you and to a date. It feels paranoid now. So did backing up your laptop, right up until the day it did not.

The Through Line

Pull the four threads together and they turn out to be the same thread. Pre-2022 archives are valuable because their provenance is beyond question. Synthetic data works when a verifier vouches for it. Expert knowledge is becoming a paid market because a contract creates a chain of custody that scraping never did. And the web is splitting along exactly that line: content that can prove what it is, and content that cannot.

Model collapse, in other words, is not the end of the story. It is the forcing function. The pollution problem from my last piece does not get solved. It gets priced in, and the price reshapes the entire industry around a single question: can you prove this data is real? It is a strange milestone for a civilization, needing a cryptographic signature to establish that a human wrote a paragraph. But here we are, and pretending otherwise is not a strategy.

The river is not getting cleaner. Nobody serious thinks it will. The next era belongs to whoever builds the best treatment plants, and whoever thought to bottle the spring water back when it was still free.