OpenAI drops the leading benchmark for evaluating code — and that changes the rules of the game
OpenAI is ending evaluation of its models on SWE-bench Verified — a benchmark long seen as the gold standard for measuring AI’s ability to write code. The compa
AI-processed from OpenAI Blog; edited by Hamidun News
When a company whose models have consistently ranked at the top of the ratings publicly abandons that rating — it's not just a corporate decision. It's a signal of a systemic problem. OpenAI announced that it is ceasing to evaluate its models on SWE-bench Verified — the benchmark that for the past two years has served as the main measure of how well AI can write and fix real code. The reason is both simple and alarming: the benchmark no longer measures what it should.
SWE-bench emerged as an ambitious attempt to move beyond synthetic tests. Instead of asking a model to solve an abstract LeetCode problem, the benchmark offered real bug reports from popular open-source Python projects — Django, scikit-learn, sympy, and others. The model had to understand the bug description, find the right file in the repository, and write a patch that would pass the tests. The Verified version came later as a cleaned-up variant — with manual human review of tasks. It was on this version that laboratories competed, proudly publishing the percentages of solved problems in each press release.
But behind the facade of impressive numbers, problems were accumulating. Internal OpenAI analysis revealed two critical vulnerabilities. The first — training data contamination.
SWE-bench Verified tasks are based on public pull requests in open repositories. These data inevitably end up in the training corpora of large language models. In other words, models could see the correct answers before they started being tested.
This is a classic data leakage problem, but in the case of SWE-bench it took on a scale that makes the results statistically meaningless. The second problem — the quality of the tests themselves. Some tasks contained incorrect or incomplete tests that could miss incorrect solutions or reject correct ones.
When a benchmark becomes popular enough, people begin to optimize for it — and not always through honest methods.
It's important to understand the context in which this decision was made. The AI coding industry is experiencing explosive growth. Dozens of startups — from Cognition with their Devin to Poolside and Magic — are attracting hundreds of millions of dollars in investment, and almost all of them use SWE-bench results as a key argument in their pitch decks. Major labs — Anthropic, Google DeepMind, OpenAI itself — publish results on this benchmark with every new model release. In effect, SWE-bench Verified became the currency of trust in the AI programming segment. And now OpenAI is saying: this currency has been devalued.
The company recommends transitioning to SWE-bench Pro — an updated version of the benchmark that, by design, solves both problems. New tasks are specifically selected to minimize overlap with public training data, and tests undergo more rigorous verification. However, a natural question arises: how long will SWE-bench Pro remain clean? The history of benchmarks in machine learning is a story of their gradual degradation. ImageNet, GLUE, SuperGLUE, MMLU — each of them over time stopped distinguishing truly strong models from those that were simply well-trained on a specific test.
For the industry, the consequences of this decision extend far beyond a single benchmark. Investors putting money into AI coding startups now must ask themselves: what actually stands behind those impressive numbers they were shown? Companies integrating AI assistants into their development processes are forced to reconsider their selection criteria. And researchers receive yet another reminder that in the race to lead on benchmarks, the connection to real usefulness is lost.
There's also a deeper question. If the world's leading AI lab admits that the standard tool for measuring progress is broken, how do we understand whether models are actually getting better? In a world where every quarter brings a new "revolutionary" model with record-breaking numbers, the absence of a reliable yardstick is not a technical trifle, but a fundamental problem.
OpenAI deserves respect for the honesty of this admission. But the very fact that the industry relied on a contaminated benchmark for so long speaks to a systemic deficit of critical thinking in the community. The transition to SWE-bench Pro is a step in the right direction.
But real progress will begin when we stop reducing AI evaluation to a single number on a single test and start building multidimensional, manipulation-resistant evaluation systems that reflect the real ability of models to help developers in their everyday work.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.