Cursor questioned public AI benchmarks for code with five charts
Cursor published five charts on how it evaluates models for programming and effectively challenged almost all public AI benchmarks. The main point: what…
AI-processed from Habr AI; edited by Hamidun News
On March 11, 2026, Cursor published an explanation of how it compares models within its product, and unexpectedly struck a blow against the entire AI-benchmarking-for-code industry. Instead of another leaderboard table, the company showed why the familiar percentages of solved tasks are increasingly poor at describing real value for developers.
Why the Charts Matter
Cursor's first conclusion is very practical: a programming model cannot be evaluated solely by the share of solved tasks. The company showed a chart where two metrics stood side by side—correctness of the answer and median tokens to completion. For the user, this is not an abstraction. Tokens turn into latency, cost, and the feel of the work. If one model solves slightly more tasks but spends several times more tokens, it can lose as a product. Public benchmarks usually hide this tradeoff and leave only one beautiful percentage in the table.
The second blow struck at the very idea of a "stable" test. CursorBench is compiled from real sessions through Cursor's Blame system, which links committed code to agent requests. According to Cursor, from the first version to CursorBench-3, the scope of tasks roughly doubled in code volume and average number of files. This means developers are already asking AI not just to fix small bugs, but to pull longer tasks spread across the project. Against this background, frozen sets like SWE-bench are aging faster and faster, even if their results are formally reproducible.
Five Weak Points
If you combine the conclusions from five charts into one frame, the result is not an advertisement for an internal benchmark, but a critique of the entire current system for evaluating coding models. Cursor is effectively saying: the industry has gotten used to measuring what is convenient to count, not what developers really feel in the editor, terminal, and long work session.
- A single-metric ranking hides tradeoffs between answer quality, speed, and cost.
- A frozen set of tasks becomes outdated while real agent requests grow longer and more complex.
- Long issues with short patches test instruction-following, not understanding of vague intent.
- Converged results among top models don't help choose a tool for production.
- Offline scores mean little if they don't correlate with how the model behaves in a real product.
How CursorBench Works
Cursor's approach differs not only in the set of tasks, but in what counts as a good test. In public benchmarks, a developer often gets a long description of a bug and makes a short, precise fix. In CursorBench, the picture is reversed: descriptions are shorter, but solutions are longer. This is closer to real work, when a person writes something like "fix login" or "refactor pipeline" to an agent, and then the model itself must understand the repository context, choose a strategy, and make significant changes across multiple files. So it tests not only accuracy, but also the ability to build out intent.
This leads to another important effect: CursorBench better separates model results at the frontier. Where public tests begin to show nearly identical scores and even place weaker models alongside stronger ones, Cursor's internal set preserves differences that match user experience. The company supplements offline evaluation with controlled online experiments on live traffic and looks not at a single number, but at a set of signals—result quality, agent behavior, and usefulness to the developer. If an offline grader considers an answer correct, but the user finds it harder to work with, such degradation still surfaces.
What It Means
The story matters not just for Cursor users. It shows that the market for coding agents has entered a stage where synthetic leaderboard tables are no longer a reliable guide, especially when choosing between the best models. The next wave of competition will not be for the loudest benchmark score, but for balance between quality, speed, cost, and how confidently the agent handles real, imperfectly formulated engineering tasks.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.