EVAL · June 9, 2026

Agents got smarter. They didn't get more reliable.

A new study ran 14 models through reliability tests and found something the benchmark race hides: two years of soaring capability produced only small reliability gains. Smarter isn't steadier. And the math is brutal — even a 95%-reliable step, run 20 times in a row, finishes the whole task correctly about a third of the time. We keep shopping for agents on intelligence when the thing that decides whether they work is something else entirely, something we barely even measure.

There's a question the benchmark leaderboards never answer: not "how smart is this agent," but "can I count on it." A team of researchers just tried to measure that directly, running 14 models through a battery of reliability tests, and the headline finding deserves to puncture some hype. Across roughly two years of rapid capability gains, they found only modest gains in reliability. The models got a lot smarter. They barely got more dependable.

That gap — between how capable an agent is and how much you can rely on it — is, in the words of one analysis, the most important, least discussed issue in enterprise AI right now. And once you see the math underneath it, you stop being surprised that so many impressive agents never make it into production.

Smart and reliable are not the same axis

We've collapsed two different things into one word: "good." A model that scores higher on a reasoning benchmark is more capable. Whether it does the same thing when you run it twice, survives a slightly reworded prompt, fails in a way you can predict, and keeps its mistakes small — that's reliability, and it's a separate axis. The study makes the separation concrete by breaking reliability into four dimensions — consistency, robustness, predictability, and safety — and measuring each. A model can be brilliant on capability and shaky on every one of those.

This is why "the new model scored higher" tells you so little about whether you can build on it. The leaderboard measures the smart axis. Your production incident at 2 a.m. is on the reliable axis. They are not the same number, and the second one is the one that decides whether your agent is a product or a demo.

The compounding math nobody wants on the slide

Here's the part that should change how you design. Agents work in steps — read, plan, call a tool, read the result, act, repeat. And reliability multiplies across steps, which is devastating, because multiplication of numbers below one goes to zero fast.

Run the numbers. If each step is 95% reliable — optimistic for today's models — then over 20 steps the chance of getting the whole thing right is 0.95²⁰, which is about 36%. At 85% per step over eight steps, you're down to roughly 27%. The reviews of real deployments match the math: as workflows get longer and more complex, failure rates climb into the 70–90% range. A per-step success rate that sounds great is a whole-task success rate that's a coin flip or worse.

Sit with what that means. A "smarter" model that nudges each step from 94% to 96% reliable feels like progress and barely moves the end-to-end number. The thing that wrecks a long agent run isn't a lack of intelligence at any single step. It's that small unreliabilities compound, and capability gains don't fix compounding.

What to do about it

If reliability is the binding constraint and capability isn't, your choices change:

Measure reliability, not just capability. Run the same task many times and look at the spread, not the best case. Perturb the input. Check how it fails, not just whether it passed once. A single green run is the least informative thing you can collect — this is the benchmark-isn't-the-job point, made rigorous.
Fight the compounding directly: use fewer steps. Every step you remove multiplies your odds back up. Collapse five model calls into one where you can, replace a reasoning step with a deterministic function, and don't make the agent re-derive what you could just hand it.
Put checkpoints between steps so errors don't propagate. Verify the output of a step before feeding it to the next — ideally against something independent. A caught error at step 3 doesn't compound into a disaster at step 15.
Stop treating "it's smarter" as "it's dependable." When the next model tops the charts, ask the different question: is it more consistent, more predictable, does it fail smaller? If you can't tell, you don't yet know if it's better for an agent — only that it's better at the test.

The bottom line

The industry sells capability because capability is what the benchmarks measure and what makes a good demo. But the study is a useful splash of cold water: two years of getting smarter bought us only a little more reliable, and reliability — not raw intelligence — is what stands between an agent and production. The compounding math guarantees it. A workflow is only as trustworthy as its weakest step raised to the power of how many steps there are.

So when you evaluate an agent, resist the leaderboard. The question was never "how smart is it on one hard problem." It's "how often does it do the ordinary thing right, the same way, twenty times in a row." Smarter is easy to sell and easy to measure. Reliable is the one that actually ships — and the one almost nobody is checking.

Comments

No comments yet

Be the first to share a thought.