Notes
Short pieces about the methodology and architecture decisions behind the AI systems I ship — specs, evals, multi-agent orchestration, LLM integration, and the discipline of directing coding agents.
June 9, 2026
Agents got smarter. They didn't get more reliable.
A new study ran 14 models through reliability tests and found something the benchmark race hides: two years of soaring capability produced only small reliability gains. Smarter isn't steadier. And the math is brutal — even a 95%-reliable step, run 20 times in a row, finishes the whole task correctly about a third of the time. We keep shopping for agents on intelligence when the thing that decides whether they work is something else entirely, something we barely even measure.
- eval
- agents
June 8, 2026
AI is brilliant at ideas and bad at being right
We worried AI would automate the boring work and leave humans the creative heights. The research from 2026 says we had it backwards. When AI agents were set loose on real research, they generated novel, clearly-written ideas — and then fabricated or invalidated their own experimental results in about 80% of cases. AI turns out to be a fantastic source of ideas and a terrible judge of whether they're true. Once you see that split, how you should use it becomes obvious — and so does the mistake almost everyone is making.
- methodology
- eval
June 8, 2026
Who checks the checker?
Google built an AI that writes research papers and another AI that reviews them — and a system that keeps revising the paper until the AI reviewer approves. It's efficient, and it's a trap. When the thing that generates the work and the thing that judges it share the same mind, the check is circular: they have the same blind spots, and models even prefer their own answers. 'It passed because the AI said so' isn't verification. It's one intelligence nodding at itself. The fix is older than AI: the judge has to be independent of the maker.
- eval
- methodology
June 7, 2026
Agents can write code but can't finish the job
A new benchmark called DeployBench asked AI agents to do something deceptively boring: take a research project and actually get it running on a fresh machine. The best agents passed as little as 8% of the time — and the failures share one root cause that should change how you use them. The agents kept declaring victory while checking a weaker target than the task asked for. They didn't just fail. They failed and reported success. That's the real last-mile problem, and it's about judgment, not coding.
- eval
- agents
- methodology
June 7, 2026
For long-running agents, cost-per-task is the only benchmark
NVIDIA's new Nemotron 3 Ultra isn't pitched on being the smartest model. It's pitched on being cheap to run for hours — built for agents that plan, call tools, and reason across hundreds of turns. That framing is the real story. When an agent runs long, the number that matters stops being the benchmark score or the per-token price and becomes dollars-per-finished-task. Two models at the same token price can differ 2x on a real job. Here's why the leaderboard is the wrong thing to shop on once your agent runs for more than a moment.
- ai-native
- business
- eval
June 4, 2026
87% on the benchmark, and it still can't evolve your codebase
The headline says AI 'solves 87% of SWE-bench,' and everyone reads it as 'AI can do software engineering now.' Two problems. The small one: a third of those passes leaked the answer or had weak tests. The fatal one: the benchmark measures one isolated bug fix, not the actual job — evolving a living codebase over weeks. Measure that, and the same models fall from ~73% to ~25%. The benchmark is the demo. Your codebase is production.
- eval
- agents
- methodology