Notes
Short pieces about the methodology and architecture decisions behind the AI systems I ship — specs, evals, multi-agent orchestration, LLM integration, and the discipline of directing coding agents.
June 14, 2026
You're running twelve agents. Half work alone.
The average company now runs about 12 AI agents, on the way to 20 by next year — and half of them operate entirely on their own, not talking to any of the others. We rushed to add agents faster than we wired them together, so most enterprises have a drawer full of clever tools that each see a sliver of the work and none of the whole. The value was never in having more agents. It's in the connections between them, and that's the part almost nobody built. Here's why the gap opened and how to close it.
- agents
- business
June 14, 2026
You have an agent. You don't have AI.
80% of enterprise apps shipped or updated in early 2026 embed at least one AI agent — up from 33% in 2024. That sounds like everyone has 'done AI.' But embedding an agent and getting value from it are different things: the median agent takes 5.1 months to pay back, and most deployments are still stuck in pilot, never scaled. Having an agent is now table stakes, like having a website. The gap that actually separates companies is whether the agent reached production, earned its keep, and got trusted to run. Here's the difference that matters.
- business
- agents
June 13, 2026
A green checkmark can hide a broken middle
Here's the failure mode that eats AI agents in production: an agent runs a multi-step task, makes a wrong turn somewhere in the middle, and still produces a final answer that passes your check. The output looks clean. The reasoning was broken. Researchers found this is exactly how multi-step agents fail — a step-three mistake propagates invisibly into a step-ten summary that reads fine and is wrong. If you only grade the final answer, you're blind to most of how agents actually break. Here's why, and what to check instead.
- methodology
- agents
June 13, 2026
The biggest context window doesn't win
Every model launch brags about a bigger context window — a million tokens, two million, the whole codebase at once. But an analysis of enterprise deployments found that nearly 65% of agent failures came from context drift or memory loss during multi-step work, not from a window that was too small. The teams shipping reliable agents in 2026 aren't the ones with the biggest window. They're the ones who curate hardest what the model actually sees. Here's the difference, and why more is often worse.
- agents
- methodology
June 13, 2026
Your agent works 57% of the time
A March 2026 report looked at 6,259 AI agents running in real production and found an aggregate success rate of 56.6% — barely better than a coin flip. The same studies show a 37% gap between how agents score on benchmarks and how they do in the real world. That gap is the whole story. The demo always works; the job is making the agent work the other 43% of the time. Here's why the number is so low, and what the teams above it actually do differently.
- agents
- methodology
June 13, 2026
The webpage can give your agent orders
When you give an AI agent a browser and let it read web pages, click buttons, and run commands, you've handed control of it to every page it visits. Researchers have shown agents hijacked by instructions hidden in website text, in pastebin links, even invisibly inside screenshots the agent looks at. It's called indirect prompt injection, and it's the number-one risk on OWASP's list for LLM apps. The agent can't tell your instructions from the page's. Here's why this is so hard to fix, and how to build so a hostile page can't run your agent.
- security
- agents