EVAL · June 4, 2026

87% on the benchmark, and it still can't evolve your codebase

The headline says AI 'solves 87% of SWE-bench,' and everyone reads it as 'AI can do software engineering now.' Two problems. The small one: a third of those passes leaked the answer or had weak tests. The fatal one: the benchmark measures one isolated bug fix, not the actual job — evolving a living codebase over weeks. Measure that, and the same models fall from ~73% to ~25%. The benchmark is the demo. Your codebase is production.

You've seen the headline: AI coding agents now "solve" 80–90% of SWE-bench, the standard benchmark for fixing real GitHub issues. In 2023 the number was around 4%. The progress is genuinely staggering, and the natural reading is "AI can do software engineering now."

That reading is wrong, in a small way and then in a fatal way, and the gap between them is one of the most important things to understand about where coding agents actually are.

The small problem: the benchmark leaks

Start with the unglamorous caveat. SWE-bench isn't as clean as the number suggests. A close audit found that about a third of "successful" patches involved the solution leaking into the model's training data, and roughly another third passed because the test cases were too weak to catch a wrong fix. Many of the GitHub issues were filed and fixed before the models' training cutoffs, so a model may have simply seen the answer during training. When researchers built a contamination-resistant version, SWE-Bench Pro, scores collapsed below 25% — GPT-5 topped it at 23.3%. So part of that impressive 87% is memory, not skill.

That's worth knowing, but it's not the real story. Even a perfectly clean benchmark would mislead you here, because of what it measures.

The fatal problem: the benchmark isn't the job

SWE-bench gives an agent one isolated GitHub issue, with a known fix, and a test that confirms when it's solved. Think about how little that resembles your actual work. Real software engineering is not a stream of self-contained puzzles with the answer key nearby. It's evolving a living codebase over weeks — interpreting a vague requirement, coordinating a change across dozens of files, preserving everything that already worked, and arguing with a reviewer about tradeoffs along the way.

The benchmark's own authors are clear about this. SWE-bench measures patch-level correctness in an isolated, single-issue setting; it does not measure an agent's ability to maintain a coherent multi-week development thread, coordinate with human reviewers, manage competing product priorities, or reason about the business implications of a technical decision. Read that list again — it's most of what the job actually is.

What happens when you measure the real thing

In late 2025, researchers built a benchmark for exactly the missing part: SWE-EVO, which tests long-horizon software evolution — multi-step changes that span, on average, 21 files and are checked against ~874 tests per task. The result is brutal and clarifying. A model configuration that scores about 73% on SWE-bench Verified scores only ~25% on SWE-EVO. Same models, same intelligence — the score doesn't dip, it falls off a cliff, because coordinating sustained change across many files is a fundamentally different and harder skill than patching one file in isolation. It's the same wall I wrote about in "one agent that does everything": hold enough of a real system in context at once and the model starts to drown.

You already know this pattern

A benchmark is a curated task with the answer reachable. Your codebase is a moving target with no answer key. Reading a high benchmark score as "can do the job" is the exact same mistake as believing a polished demo — and I've made that argument before: the demo proves the agent can succeed once, under conditions someone chose for it. Production — and a real codebase is production — asks whether it can keep succeeding on work nobody curated. SWE-bench is the demo. SWE-EVO is a glimpse of the job.

The honest read

This is not "AI coding is fake." Going from 4% to genuinely strong on isolated issue resolution is real, and an agent that reliably fixes a self-contained bug is a real lever I use every day. The mistake is reading the leaderboard as a measure of "can replace a software engineer." Because what makes someone a software engineer is precisely the long-horizon part the benchmark leaves out: carrying intent across weeks, coordinating change through a whole system, and not breaking the twenty files nobody asked you about. That's the work. The benchmark measures the warm-up.

So when the next "AI hits 90% on SWE-bench" headline lands, ask the only question that matters: ninety percent of what? One curated issue with the answer nearby is not your Tuesday. Until a benchmark can measure evolving a real codebase over weeks without breaking it, the score is measuring the demo — and you already know the demo was never the job.

Comments

No comments yet

Be the first to share a thought.