Notes
Short pieces about the methodology and architecture decisions behind the AI systems I ship — specs, evals, multi-agent orchestration, LLM integration, and the discipline of directing coding agents.
June 5, 2026
Agents are arriving where a mistake is a lawsuit
This week Experian shipped an 'Agent OS' for lending — agents that decide credit, flag fraud, determine who's eligible. These are the rooms where a hallucination isn't an awkward chatbot reply; it's a denied loan, a wrong medical authorization, a court date. And one number sets the stakes: AI healthcare denials are overturned 80%+ of the time on appeal — but fewer than 1% of people appeal. Here's why regulated domains are where the whole agent argument becomes law.
- architecture
- business
- agents
June 5, 2026
Microsoft sent 100 agents to hunt bugs — AI vs AI security, honestly
This week Microsoft showed a security team made of AI: a pipeline of 100+ agents that found 16 new Windows vulnerabilities, four of them critical, plus the first AI to auto-convict malware. The defenders now run autonomous AI. So do the attackers — one ran 80–90% of a real intrusion on its own. 'AI vs AI security' stopped being a slogan this spring. Here's the honest read: it's real progress, and a faster stalemate.
- security
- agents
June 5, 2026
The agent that "closes sales" — the part the demo hides
Meta just shipped an agent that doesn't only chat — it books appointments, qualifies leads, closes sales, and takes payments, 24/7, in any language, wired into Shopify and Zendesk. A million businesses are already on it. The demo is magic. What it hides: an autonomous thing acting on your business, at machine speed, on messages from strangers — and the law just closed the 'the AI did it' escape hatch. Here's the honest version.
- security
- business
- agents
June 5, 2026
"Which part do we agentize first?" is the wrong first question
The whole market has moved from 'are agents real?' to 'which part of my company gets agentized first?' — IT support, sales, reconciliations. It feels like the smart strategic question. It's the wrong one. Asking where to point the agent skips the two questions that actually decide whether any of it works: what does the agent stand on, and who answers when it's wrong. Here's the order that matters.
- methodology
- business
- agents
June 4, 2026
87% on the benchmark, and it still can't evolve your codebase
The headline says AI 'solves 87% of SWE-bench,' and everyone reads it as 'AI can do software engineering now.' Two problems. The small one: a third of those passes leaked the answer or had weak tests. The fatal one: the benchmark measures one isolated bug fix, not the actual job — evolving a living codebase over weeks. Measure that, and the same models fall from ~73% to ~25%. The benchmark is the demo. Your codebase is production.
- eval
- agents
- methodology
June 4, 2026
The labs are racing on price now, not IQ
For two years a flagship model reveal had one headline: we're the smartest, here's the benchmark we beat. At Microsoft Build 2026 the headline changed — same league as Opus, but ~10x more output per dollar and 60% fewer tokens. The boast moved from IQ to efficiency, and the whole industry is reorganizing around price, not peak capability. Here's why the axis flipped, and what it means if you build.
- ai-native
- business
- agents