fedorthinks
All notes

SECURITY · July 3, 2026

Prompt injection isn't a bug you'll patch

Teams keep treating prompt injection like an ordinary vulnerability — one that a model update or a clever filter will eventually close. It won't. OWASP's 2026 report and a growing line of researchers now describe it as a permanent property of how LLMs work: the model genuinely can't tell your instructions from the data it's reading. Once you accept that, the job changes. You stop trying to prevent the injection and start making sure a successful one can't do any damage — which comes down to never letting a single agent hold the three powers that turn a poisoned input into a breach.

Prompt injection isn't a bug you'll patch

Every few months a new defense against prompt injection makes the rounds — a better system prompt, a classifier that flags "ignore previous instructions," a fine-tune that's supposed to be more obedient. And every few months someone slips past it with a slightly reworded attack. We keep treating this like a bug on its way to a patch. It isn't one.

OWASP's 2026 report puts prompt injection at the center of agentic AI risk, and security researchers are now saying the quiet part plainly: it may be a permanent flaw, not a patchable bug.

Why it doesn't get "fixed"

A classic vulnerability is a mistake in code — SQL injection lives in a bad query, and a prepared statement closes it for good, because the database can tell structure from data. An LLM has no such line. Instructions and data arrive as the same thing: text in the context window. When your agent reads a web page, an email, an error report, or the output of a tool, every token in there is a candidate instruction, and the model has no reliable way to know that this sentence is your command and that one is an attacker's. That's not a defect in a particular model. It's the mechanism. Your agent trusting the tool description is the same hole; so is a web page that can hand it orders. Same root, different door.

Stop asking "how do I stop the model from being tricked?" Assume it will be tricked, every time, and ask "what can the trick actually reach?"

Design so a win costs the attacker nothing

The useful frame is Simon Willison's lethal trifecta: an injection turns into a breach only when a single agent has all three of these at once —

  1. access to private data (your emails, your DB, your files),
  2. exposure to untrusted content (anything an attacker can influence — a page, a doc, a ticket),
  3. a way to send data out (an HTTP call, an email, even rendering an image from a crafted URL).

Hold any two and a successful injection fizzles. Grant all three in one session and a single poisoned sentence becomes a working exfiltration pipeline — no exploit code required. So the whole security job becomes: never let those three meet.

  • Least privilege on tools, per task. The agent reading untrusted content doesn't also hold the keys to the customer database. Scope credentials to the job in front of it, not the whole org.
  • Break the exfiltration leg. An agent that ingests outside text shouldn't have an open outbound channel. No arbitrary HTTP, allow-listed domains only, no silent image fetches from attacker-supplied URLs.
  • Split the pipeline. Let one component summarize the untrusted document with no data access and no network; hand only the vetted result to the privileged step. Two safe agents beat one lethal one.
  • Human gate on the irreversible. Sending money, deleting records, emailing customers — a person approves. The boring, bounded design is the one that survives.

This isn't hypothetical plumbing. A poisoned dependency — a backdoored LLM gateway sitting on PyPI for three hours, tens of thousands of installs — gets you untrusted content and privilege in one move. The trifecta is how a small compromise becomes a big one.

The bottom line

Waiting for the vendor to "solve" prompt injection is a security strategy built on a fix that isn't coming. The teams that stay safe aren't the ones with the cleverest filter — they're the ones who assumed the filter fails and made sure it doesn't matter.

Treat every input your agent reads as hostile, and architect so no single agent ever holds private data, untrusted content, and an outbound channel at the same time. You can't patch the model. You can make sure that when it gets tricked, there's nothing on the other side of the door.

Comments

No comments yet

Sign in to join the conversation.

Be the first to share a thought.