There’s a new study from METR that I can’t stop thinking about. They took hundreds of AI-generated pull requests that passed SWE-bench Verified — the gold standard benchmark for AI coding agents — and showed them to actual maintainers of the real repositories. The result: roughly half of those PRs would not have been merged.
Let me sit with that for a moment. Fifty percent pass rate on the benchmark. Twenty-four percentage points lower in the real world. That’s not a rounding error. That’s a chasm.









