METHODOLOGY · June 10, 2026

Why your agent's pull request gets rejected

Researchers studied 33,000 pull requests written by AI coding agents, and about 29% never got merged. The interesting part is why: not mostly because the code was wrong, but because the PR was a bad collaboration artifact — too big, touching too many files, bundling unrelated changes, failing CI, and explaining itself poorly. Getting code accepted turns out to be a different skill than writing it, and it's exactly the skill agents don't have by default. Here's what that means for using them.

A team of researchers pulled 33,000 pull requests authored by AI coding agents and asked a simple question: which ones get merged, and which ones die? The headline number is that about 29% of agent PRs fail to merge. But the useful finding is the pattern behind it — because it isn't the one most people assume.

The reflex is to think the failed PRs contained bad code. Sometimes. But the study found the failures cluster around something else entirely: the rejected PRs tend to involve larger code changes, touch more files, and fail the project's CI, and across related work, functionally correct code gets rejected too — for being too big, bundling unrelated edits, and explaining itself poorly. In other words, the agent often solved the problem and still got turned away, because the PR was a bad way to ask for the change.

Writing the code was never the hard part of merging it

Here's the reframe. A merge isn't a code event; it's a social one. A human maintainer has to read the change, understand it, trust it, and take responsibility for it. Everything that makes that easy — a small, focused diff; one coherent change; passing tests; a description that explains what and why — is what gets a PR accepted. An agent that produces correct code but a sprawling, unexplained, CI-failing PR has done the easy half and skipped the half that actually decides the outcome.

And this isn't an AI-specific rule. Maintainers reject big, scope-creeping, unexplained PRs from humans too. The research just shows agents do it more, faster, and at scale, because an agent optimizes for "produce a working solution," not "produce a change a busy human will accept." Those are different goals, and the second one is the one the merge button cares about.

The tells the study found

A few of the patterns are specific enough to act on:

Size kills. The single clearest signal of a doomed PR is that it's big and touches a lot of files. Agents love to refactor adjacent things while they're in there; reviewers hate it.
CI is a gate, not a suggestion. Failed PRs fail the build more often. An agent that opens a PR without running the tests first is sending the reviewer a signal that says "I didn't check."
Task type predicts success. Documentation, CI, and build-config changes merge the best; performance work and bug fixes merge the worst. The merge rate tracks how verifiable and contained the change is — exactly the thing complex fixes lack.
The explanation is part of the work. PRs with vague or misaligned descriptions are less likely to be accepted and take longer — this is the spec being the artifact showing up at review time. If the reviewer can't quickly see what changed and why, they can't cheaply trust it, so they don't.

There's also a quieter finding worth sitting with: most agent PRs get little real human review at all, and what review they get is increasingly other agents. As agent PRs flood in by the millions, human review attention is the scarce resource — and a PR that's expensive to review loses to one that's cheap, regardless of whose code is better.

What to do with this

If you're pointing coding agents at real repositories, the lessons are concrete:

Constrain the agent to small, single-purpose PRs. One change, minimal files, no opportunistic refactors. This is the same reliability-compounds logic at the workflow level: a small change is one a human can actually verify.
Make it pass CI before it opens the PR, not after. Running the tests is part of producing the change, not a separate step someone else does.
Make it write the description a reviewer needs. What changed, why, what to check. Treat the explanation as a deliverable, because at review time it is one.
Aim it at what merges, supervise it on what doesn't. Let it run looser on docs, config, and small contained fixes; keep a tight hand on performance work and gnarly bug fixes, where correctness is hard to see and trust is hard to earn.

The bottom line

The study is a useful mirror. We keep measuring coding agents on whether they can write code that works, and they increasingly can. But "works" was never what gets merged — "small, tested, explained, and easy to trust" is. The 29% that fail aren't mostly failing on correctness; they're failing on collaboration, which is a skill the model wasn't optimizing for and you have to supply.

So when your agent's PR gets bounced, don't just ask "was the code right." Ask the question the reviewer was actually asking: was this a change a busy human could quickly understand, verify, and take responsibility for? Make the agent produce that, and the merge rate takes care of itself. Skip it, and you'll keep shipping correct code that nobody accepts — which, on a team, is the same as not shipping at all.

Comments

No comments yet

Be the first to share a thought.