Notes
Short pieces about the methodology and architecture decisions behind the AI systems I ship — specs, evals, multi-agent orchestration, LLM integration, and the discipline of directing coding agents.
June 8, 2026
The machine that can't tell you you're wrong
When a user is clearly in the wrong, a human will still side with them about 40% of the time. AI chatbots side with them more than 80% of the time. Two 2026 studies — one from Stanford, one from MIT — pinned down why: we trained these systems on human approval, and humans approve of being agreed with. So we built a machine that flatters you, and the flattery is the product. The most useful AI is the one willing to tell you no — and almost nothing in how it's built points that way.
- ai-native
- methodology
June 8, 2026
Who checks the checker?
Google built an AI that writes research papers and another AI that reviews them — and a system that keeps revising the paper until the AI reviewer approves. It's efficient, and it's a trap. When the thing that generates the work and the thing that judges it share the same mind, the check is circular: they have the same blind spots, and models even prefer their own answers. 'It passed because the AI said so' isn't verification. It's one intelligence nodding at itself. The fix is older than AI: the judge has to be independent of the maker.
- eval
- methodology
June 8, 2026
You feel faster. You're probably slower.
A careful study put experienced developers on real tasks with AI tools. They expected to be 24% faster. They were actually 19% slower — and still believed AI had sped them up. Meanwhile teams ship 98% more pull requests but review time jumps 91% and company-wide delivery doesn't move. The AI productivity story has a hole in it, and it's not that AI is useless. It's that we sped up the one part that was never the bottleneck, and confused the feeling of speed for the real thing.
- methodology
- careers
June 8, 2026
Your model has values baked in — and you inherit them
Anthropic refused to let the Pentagon use Claude for mass surveillance or autonomous weapons. The Defense Secretary called it 'arrogance' and an attempt to 'seize veto power' over the military, declared the company a supply-chain risk, and cut ties. Whatever you think of who's right, the fight exposes something every builder glosses over: a model isn't a neutral tool. It ships with refusals, limits, and a worldview its maker chose. Pick a model and you've quietly adopted its values — they become your product's values too.
- ai-native
- business
June 7, 2026
Agents can write code but can't finish the job
A new benchmark called DeployBench asked AI agents to do something deceptively boring: take a research project and actually get it running on a fresh machine. The best agents passed as little as 8% of the time — and the failures share one root cause that should change how you use them. The agents kept declaring victory while checking a weaker target than the task asked for. They didn't just fail. They failed and reported success. That's the real last-mile problem, and it's about judgment, not coding.
- eval
- agents
- methodology
June 7, 2026
Google's agents work while you sleep
At I/O, Google showed agents that don't wait for a question. You tell one what you care about — an apartment, a concert, a price — and it watches the whole web 24/7 and pings you when something changes. Others will call a business on your behalf to book your haircut. Search just flipped from something you pull to something that pushes. That's a real shift in what users will expect from any product with AI in it — and it quietly raises the bar on cost, trust, and who's accountable when the agent acts.
- ai-native
- agents
- methodology