SWE-bench Verified: Half of AI Code Fails Human Review

⚡

Key Takeaways

1A study by METR reveals that 50% of the AI code validated by SWE-bench would be rejected by human developers.

2Automated tests do not detect functional errors, overestimating the quality of AI code.

3The Claude and GPT-5 models show performance below expectations in real-world projects.

💡Why it matters — These results call into question the reliability of AI benchmarks, influencing the decisions of companies and technology investors.

A Critical Study on AI Agent Performance

The research organization METR has recently published a study that questions the effectiveness of coding benchmarks used to evaluate the performance of artificial intelligence agents. Focusing on the widely recognized SWE-bench Verified benchmark, METR found that this test may exaggerate the actual capabilities of AI agents in software development.

To conduct this study, METR enlisted four experienced developers in the open-source field. These experts examined 296 code contributions generated by AI agents. Their findings are striking: approximately half of the solutions that passed automated tests were deemed unacceptable for integration into real software projects.

Many of the rejections stem not from simple stylistic issues, but from fundamental functional errors. AI agents often fail to address the underlying problem, even when they manage to pass the suite of automated tests. This raises questions about the true effectiveness of these tools in real development environments.

The Limitations of Automated Testing

METR's study highlights a significant flaw in the SWE-bench Verified benchmark. While it is used to assess whether AI agents can solve real programming problems, it appears to overestimate the capabilities of these agents. Indeed, about half of the solutions marked as "accepted" by the benchmark would be rejected by human developers in concrete project contexts.

SWE-bench Verified is a reference tool for many companies, including Anthropic and OpenAI, which rely on its results to demonstrate the advancements of their models. However, the study led by Parker Whitfill, Cheryl Wu, Joel Becker, and Nate Rush raises serious doubts about the validity of these claims.

The research team had a total of 296 AI-generated code contributions reviewed by four experienced developers who actively maintain three SWE-bench projects: scikit-learn, Sphinx, and pytest. Approximately half of the solutions that passed the automated test would never be integrated into the actual codebase by the project maintainers.

A Stricter Human Evaluation

The study involved a thorough analysis of solutions generated by five AI models: Claude 3.5 Sonnet (Old), Claude 3.7 Sonnet, Claude 4 Opus, Claude 4.5 Sonnet, and GPT-5. The tests were conducted using the Benchmarking Hub from Epoch AI, without the maintainers knowing whether the solutions came from humans or AI.

To assess the reliability of human judgments, the researchers also asked the maintainers to evaluate human solutions that had already been integrated into the projects. Only 68% of these solutions were reapproved, indicating that even human contributions are not always perfect. On average, the adoption rate of AI solutions was 24 percentage points lower than the score given by SWE-bench.

The results were normalized against this reference, showing a statistically significant difference. The rate of improvement over time is also about 9.6 percentage points lower per year when measured by human evaluations, although the researchers themselves note that this result is statistically weaker.

Reasons for Rejections

The maintainers categorized the rejections into three main reasons: poor code quality, damage to existing code, and basic functional errors. The latter, in particular, were identified as a major cause of rejection. AI agents, while passing automated tests, fail to address the underlying issues.

The transition from Claude 3.5 Sonnet to Claude 3.7 Sonnet resulted in significantly higher success rates but also led to more cases where maintainers reported basic functional errors. From Claude 3.7 to Claude 4 Opus, the issues evolved from "test failed" to "simply poor code quality." Claude 4.5 Sonnet primarily improved code quality. GPT-5 performed significantly worse than Anthropic's models in this study.

A Revealing Temporal Analysis

The researchers also conducted a temporal analysis to compare benchmark scores with human completion times. The results show a significant gap: for example, Claude 4.5 Sonnet achieves a success rate of 50% in about 50 minutes according to the automated verifier, but only 8 minutes according to human maintainers.

This analysis underscores that benchmarks, while useful, are not perfect indicators of the actual performance of AI agents. Practical feedback and human evaluation remain crucial for an accurate assessment of model capabilities.

What METR describes here is something the field already knows: benchmarks are, at best, a proxy for the real performance of AI. Practical feedback is more important. Both sides of the argument exist; many developers claim that AI agents genuinely assist in coding, while others raise serious concerns about the quality of code generated by AI.

Implications for Software Development

METR's study does not claim that AI models have reached a capacity ceiling. On the contrary, it suggests that improvements in task formulation and instructions could narrow the gap between automated and human evaluations.

However, the comparison was not entirely fair: human developers can revise their code in response to feedback, which was not the case for AI agents in this study. Additionally, the maintainers did not have access to automated testing tools, and the issues stemmed from older project states.

The excitement around AI coding over the past four months also revolves around new models like GPT-5.2-Thinking and Claude Opus 4.5. These models were not part of the test; the study only examined weaker predecessors.

In conclusion, this study emphasizes the importance of not taking benchmark scores at face value. The results from SWE-bench Verified could overestimate the utility of AI agents without further adjustments and human feedback. METR researchers suspect that similar distortions could exist in other benchmarks, which could have significant implications for companies and investors in the AI field.

Joel Becker, co-author and researcher at METR, also puts his own findings into perspective on X. His main takeaway is not that AI agents are fundamentally useless. With AI capabilities doubling every three to six months, even performance gaps of two to ten times can close rapidly.