Half of AI-Generated Pull Requests Would Be Rejected by Real Maintainers — New METR Research Explains Why

Half of AI-Generated Pull Requests Would Be Rejected by Real Maintainers — New METR Research Explains Why

Last Tuesday, my team lead Marcus forwarded me a research paper with the subject line "read this before our sprint planning." I almost ignored it — Marcus forwards a lot of papers — but the title caught me: "Many SWE-bench-Passing PRs Would Not Be Merged into Main."

I read it during lunch. Then I read it again. Then I called Marcus and said, "We need to talk about how we're evaluating AI coding tools."

Because here's the thing: the AI industry has been using SWE-bench scores like a GPA — the higher the number, the smarter the model. And this paper from METR essentially says that GPA doesn't measure what we thought it measured.

AI-generated pull request code review showing quality issues found by METR research

Photo by Markus Spiske via Pexels

The Study That Should Make Every Engineering Manager Pause

For the uninitiated: SWE-bench is one of the most widely cited benchmarks for evaluating how well AI models can write code. It presents models with real GitHub issues from popular open-source projects and asks them to generate pull requests that fix the issue. If the PR passes the automated test suite, it "passes."

METR — a research organization focused on evaluating AI capabilities — decided to ask a simple but devastating question: would these test-passing PRs actually get merged by the humans who maintain these projects?

They had four active maintainers from three SWE-bench repositories review 296 AI-generated pull requests. The results were... sobering.

Roughly half of the PRs that passed the automated grader would have been rejected by maintainers. Not "needs minor tweaks" rejected. "Please rethink your approach" rejected.

The Numbers That Matter

Let me break this down because the methodology is actually clever.

To account for the fact that human reviewers are subjective (maintainers are people, and people have bad days), METR also had the same maintainers review 47 "golden patches" — the original human-written PRs that were actually merged into main. Those golden patches got a 68% approval rate from the blind review, establishing a baseline for reviewer noise.

After adjusting for that baseline, the gap between "passes automated tests" and "would actually be merged" was about 24 percentage points. That's enormous.

Even more concerning: the rate of improvement over time was 9.6 percentage points per year slower for maintainer merge decisions than for automated test scores. In other words, AI models are getting better at passing tests faster than they're getting better at writing code that humans actually want in their codebase.

Marcus's reaction when I explained this: "So we've been measuring the wrong thing." Pretty much.

Why Did Maintainers Reject These PRs?

This is where it gets interesting. The rejections fell into three buckets:

1. Core functionality failure — The fix technically passed the tests but didn't actually solve the underlying problem correctly. Think of it like fixing a leaky pipe by putting a bucket under it. The floor stays dry (test passes), but the pipe is still leaking.

2. Breaks other code — The PR introduced side effects that the test suite didn't cover. Real codebases have implicit contracts between components that aren't captured in unit tests. AI models, it turns out, are terrible at respecting these invisible boundaries.

3. Code quality issues — The fix worked, didn't break anything, but was written in a way that no self-respecting maintainer would accept. Wrong patterns, ignored project conventions, unnecessary complexity, missing documentation.

My colleague Priya, who maintains a mid-sized open-source project, wasn't surprised. "I've reviewed AI-generated PRs," she told me. "They're like getting homework from a student who read the textbook but never attended class. Technically correct, contextually clueless."

What This Means If You're Using AI Coding Assistants

Before anyone panics: METR explicitly says this isn't about fundamental capability limitations. Their own caveat is important — the AI models only got one shot. Human developers iterate based on code review feedback. Given a chance to respond to "please use our existing utility function instead of reimplementing it," most modern AI models could probably comply.

But that caveat actually reinforces the point. In the real world, AI coding isn't a one-shot operation. It's a conversation. And if your workflow treats AI output as "done" the moment tests pass, you're setting yourself up for exactly the kind of technical debt this study documented.

Here's what I'd recommend based on this research:

Stop using benchmark scores as your primary evaluation metric. A model with a 60% SWE-bench score doesn't resolve 60% of real-world issues. It resolves maybe 30-35% in a way that would actually be acceptable. That's still useful! But calibrate your expectations.

Budget review time into your AI coding workflow. If you've cut code review time because "the AI already tested it," you're making a mistake. AI-generated code needs more review attention, not less, specifically for project conventions and implicit architectural decisions.

Invest in comprehensive test suites. The gap between "passes tests" and "actually correct" exists because test suites are incomplete. This has always been true, but it matters more now that you have a tireless code generator that will exploit every gap in your test coverage.

Treat AI PRs like junior developer PRs. Not because AI is "dumb," but because the failure modes are similar: technically functional code that doesn't understand the project's unwritten rules. The fix is the same too — mentorship through review feedback.

The Uncomfortable Implication

Here's the part nobody in the AI industry wants to talk about: benchmarks are marketing tools as much as they are research tools. When Anthropic or OpenAI or Google announces a new SWE-bench score, that number shows up in press releases, investor decks, and Twitter threads. It shapes buying decisions.

METR's research suggests that number might be inflated by roughly 50% relative to real-world usefulness. That's not fraud — the benchmark measures what it says it measures. But the interpretation of that measurement has been consistently too generous.

Marcus summed it up better than I can: "It's like hiring someone based on their SAT score and then being surprised when they can't navigate office politics."

The AI is getting better. That's not in question. But the gap between "passes automated checks" and "ready for production" is wider than most teams realize. And now we have the data to prove it.

(I told Priya I was writing this piece and she said, "Make sure you mention that the AI-generated PRs also never include a good commit message." Noted, Priya. Noted.)

Related coverage: Debian refused to ban AI-generated code and our best AI code assistants for 2026 review.

If you found this useful, check out these related articles:

Found this helpful?

Subscribe to our newsletter for more in-depth reviews and comparisons delivered to your inbox.