2026年5月10日

Evals or it didn't ship

Why I refuse to ship an agent without a held-out evaluation set, what makes one useful, and the failure mode I keep seeing when teams skip this.

本文尚未翻译为您的语言——正在显示英文原文。

The most common failure mode I see in AI projects is this: a team ships a demo, the demo works on the inputs the team has been staring at for weeks, and nobody knows whether the system actually got better or worse between June and August.

The fix is unglamorous: a held-out evaluation set, run before every merge, that the agent never sees during development.

What "held out" means in practice

If you wrote the agent and you've also seen the eval inputs, the eval is contaminated. Same goes for prompt tweaks: every prompt I touch sees the development split. The held-out split is locked in version control, read by CI, and inspected by humans only when the score moves enough to investigate. That's it.

What makes an eval useful

Three properties, in order:

It reflects production traffic. A clever benchmark someone else built is useful for cross-system comparison but doesn't tell you what your users actually need. Mine production logs, sample, anonymize, curate. Repeat quarterly.
It scores something you'd actually act on. "Helpfulness 4.2/5" is not actionable. "78% of scenarios pass the structured output contract" is.
It's cheap to run. If the eval takes six hours, nobody runs it on a PR. Aim for five minutes for the smoke set, an hour for the full one.

The shape of a good eval suite

Two layers:

A small smoke set (~30 scenarios) that runs on every PR. Catches regressions early.
A larger held-out set (~300 scenarios) that runs on main, weekly. The set the agent never sees during development.

Public benchmarks belong alongside these, not instead of them — they're useful for talking to other teams, but the contract that matters is the one with your own production traffic.

What I've stopped accepting

"It looked good in testing" is not a state I let agents enter production in. If the eval doesn't exist, the agent doesn't ship. If the eval score regresses, the merge doesn't happen until we understand why.

This sounds harsh until you watch a system silently regress over six sprints because nobody was measuring. Then it sounds obvious.

暂无评论

登录以参与讨论。

做第一个分享想法的人。