SECURITY · July 1, 2026
The internet went dark. Build for a web you can't trust.
'Dead internet theory' used to be a conspiracy meme. Now that a majority of new web pages contain AI-generated content, it's an engineering constraint. Your agents retrieve from a web where you can no longer know who — or what — produced anything. The danger isn't that everything's fake; it's that provenance became unknowable. Which means 'it's on the internet' is dead as a trust signal, and trust has to move to the data layer: signed, allowlisted, provenance-tracked sources.
"Dead internet theory" was a conspiracy corner of the web for years: the paranoid idea that most of the internet is bots talking to bots. It's not paranoid anymore. By one 2025 analysis, over 74% of newly published web pages contained AI-generated content. Whatever the exact figure, the direction is undeniable: the open web is now majority-synthetic, and that turns a meme into an engineering problem — especially if you build anything that reads the web.
The problem isn't fake. It's unknowable.
The instinct is to worry that AI content is wrong. That's not quite the threat. Plenty of it is fine. The real damage is subtler, and Andrew Stiefel put it well: the dead internet "kills trust by making everything unknowable." You can no longer tell what a page is: human expertise or model output, genuine or SEO chaff, a real review or a generated one, a primary source or a hallucination three hops downstream that now looks like a citation.
For a human, that's annoying. For an agent, it's structural. Your RAG pipeline, your research agent, your grounding layer — they all reach out to this web and pull content in as if it were signal. But "I found it on the internet" no longer means anything. The provenance you were implicitly trusting isn't degraded; it's gone.
"It's on the internet" used to be weak evidence. Now it's no evidence. The web stopped being a source of truth and became a source of plausible text — which is exactly what a model already produces.
Verifying the source now matters more than verifying the model
Everyone obsesses over whether the model hallucinates. But if you ground a model on a real source so it can't make things up, and that "source" is itself AI slop of unknown origin, you built a laundering machine: you've taken untrusted text and given it the authority of a citation. A perfectly honest model grounded on a poisoned web produces confident, well-sourced nonsense.
So the trust problem moves down a layer. It's not "is the model right?" It's "do I trust where this came from?" And on a majority-synthetic web, the answer defaults to no.
Build for a dark forest
If provenance is dead by default, you have to make it explicit and earned:
- Allowlist, don't crawl-and-pray. Curate a set of sources you've actually vetted. A small trusted corpus beats the open web the way a library beats a landfill.
- Prefer signed and primary. Provenance chains, signatures, first-party data, the actual paper over the blog that summarized the tweet about it. Get as close to the origin as you can.
- Treat retrieved web text as untrusted input. It's not just a knowledge source; it's attacker- and slop-influenced content your agent ingests. Verify the source, not just the model.
- Become a source worth citing. The flip side of a polluted web is that verifiable, first-party, genuinely-human signal gets more valuable. Be the thing the agents can trust.
The bottom line
The web crossed a line: it's now mostly machine-made, and the thing that broke isn't accuracy, it's knowability. Grounding an agent on "the internet" now means grounding it on an ocean of unattributed synthetic text — which defeats the entire point of grounding.
Stop trusting the web by default. Move trust to the data layer — allowlist, sign, trace provenance — because on a synthetic internet, verifying the source is the only grounding that still means anything.
Comments
No comments yet
Sign in to join the conversation.
Be the first to share a thought.