METHODOLOGY · June 13, 2026

A green checkmark can hide a broken middle

Here's the failure mode that eats AI agents in production: an agent runs a multi-step task, makes a wrong turn somewhere in the middle, and still produces a final answer that passes your check. The output looks clean. The reasoning was broken. Researchers found this is exactly how multi-step agents fail — a step-three mistake propagates invisibly into a step-ten summary that reads fine and is wrong. If you only grade the final answer, you're blind to most of how agents actually break. Here's why, and what to check instead.

Here's the most dangerous way an AI agent fails, because it's the one you won't see. The agent runs a multi-step task. Somewhere in the middle it takes a wrong turn. And it still hands you a final answer that passes your check — looks clean, well-formatted, plausible, and wrong. The checkmark is green. The middle is broken. You shipped it.

This isn't a rare edge case; researchers studying agent reliability describe it as a core failure mode. In a multi-step task, an intermediate mistake can pass a final-output check while corrupting the whole workflow. Their example is sharp: a research agent correctly retrieves a competitor's information, misattributes one product feature to the wrong company in step three, and produces a final summary that passes a surface-level check while the factual error rides along invisibly.

I want to sit on this one, because it's the gap between "my agent passed its tests" and "my agent is reliable," and those are not the same thing.

Why the final answer lies

When you test normal software, checking the output is usually enough — deterministic code that produces the right answer got there the right way. Agents break that assumption. They're non-deterministic, they reason in long chains, and a chain has many ways to reach a plausible-looking endpoint while being wrong on the way.

So a passing final answer tells you less than it does for ordinary code. The agent can get the format right and the facts wrong. It can reach a reasonable-sounding conclusion from a botched intermediate step, the way a student can land on the right-looking answer through canceling errors. Worse, confident, fluent output is exactly where a model is most dangerous when it's wrong — the polish that makes the answer pass your check is the same polish that hides the broken reasoning underneath it.

Why this is the expensive bug

A visible failure is cheap — the agent errors out, you see it, you fix it. This one is expensive precisely because it looks like success. The summary gets sent to the client. The number flows into the report. The misattributed feature becomes a fact your team repeats. By the time anyone notices, the error has propagated through everything downstream that trusted the green checkmark.

And it compounds with the reliability math. A 2026 coding workflow averages around twenty dependent steps, and a final-output check only looks at the last one. Nineteen places to take a wrong turn, one place you're looking. That's how agents post good demo numbers and then quietly disappoint in production: the demo grades the answer, production lives with the reasoning.

What to check instead

The fix is to stop grading only the destination and start grading the journey:

Eval the steps, not just the output. Evals or it didn't ship — and for agents that means checking intermediate reasoning, tool calls, and retrievals, not just the final string.
Make the agent show its work. An agent that exposes its intermediate reasoning and sources lets you — or another model — catch the step-three slip before it reaches step ten. A black box that only emits a final answer gives you nothing to inspect.
Verify the facts against the source. For retrieval and research tasks, check that each claim traces back to what was actually retrieved. The misattribution survives a style check; it dies against the source.
Put a checkpoint before anything irreversible. If a step sends, pays, deletes, or commits, that's where a human or a hard validation belongs — not at the very end, after the broken middle already acted.

This is more work than reading the final answer. That's the point: the final answer was always the cheap thing to check, and cheap checks are why broken middles ship.

The bottom line

A green checkmark on the final output feels like proof the agent worked. For multi-step agents, it's weaker evidence than it looks — the output can be clean while the reasoning that produced it was wrong, and that exact gap is one of the main ways agents fail in production. Grade only the destination and you're blind to most of the journey, which is where the failures live.

So when you evaluate an agent, distrust the clean final answer a little. Ask how it got there, check the steps that matter, and verify the facts against their source. The reasoning is the product; the answer is just where it surfaces. A broken middle with a green checkmark on top is still broken — and the whole job is catching it before your users do.

Comments

No comments yet

Be the first to share a thought.