OpenAI GPT-5.4 solved a FrontierMath problem a mathematician spent 20 years devising
OpenAI GPT-5.4 solved a FrontierMath problem that Polish mathematician Bartosz Naskręcki had been preparing for nearly 20 years and considered beyond AI's…
AI-processed from Habr AI; edited by Hamidun News
OpenAI GPT-5.4 solved a problem from the FrontierMath benchmark that Polish mathematician Bartosz Naskręcki had been building for nearly twenty years and considered practically inaccessible to machines. For the author himself, this became a personal turning point: not long ago he called AI a "very advanced calculator," and now he speaks of a new level of collaborative work with the model.
Why this surprised everyone
FrontierMath is one of the harshest mathematical benchmarks for AI. It contains 350 original problems in number theory, algebraic geometry, topology, combinatorics, and analysis. The heaviest tier, Tier 4, consists of 48 research-level problems: even a strong mathematician with a PhD might need at least a month just to understand which angle to approach such a problem from. This is exactly the kind of case Naskręcki was preparing his example for — not a textbook one, but nearly extreme in complexity.
Naskręcki was one of the few European mathematicians invited to compose problems for this set. His problem grew out of roughly fifteen years of narrowly focused research work, and the formalized solution took up 13 dense pages. The answer was a very large number to exclude random guessing. Therefore, what was surprising was not only GPT-5.4's correct answer itself, but also the way the model arrived at it: instead of brute-force enumeration, it noticed the structure and found a shorter path. According to the author, the model's approach turned out to be "clean and elegant."
"My singularity just happened… and on the other side there is life —
receding into infinity!"
How quickly the result grew
The story is important not only because of one beautiful problem, but because of the speed of progress. When FrontierMath was launched in late 2024, the best models solved less than 2% of the problems. Over sixteen months, the results grew by an order of magnitude, and not only on open examples, but also on the hidden set, which OpenAI did not have direct access to. This matters because the argument about "overfitting to answers" remains the main objection of skeptics whenever a new model shows a strong jump in mathematics.
- End of 2024: best models solve less than 2% of FrontierMath problems.
- Mid-2025: GPT-5 Pro reaches 13% on Tier 4.
- January 2026: GPT-5.2 Pro rises to 31% on Tier 4.
- March 2026: GPT-5.4 Pro reaches 50% across levels 1–3 and 38% on Tier 4.
The result on hidden problems stands out separately. According to the article, GPT-5.4 solved 55% of such examples versus 25% of problems that OpenAI could theoretically be closer to based on data and solutions. This does not prove absolute "purity" of the experiment, but significantly strengthens the version that the model really knows how to reason on new problems rather than simply reproduce seen patterns. For research benchmarks, this is perhaps the most sensitive test: novelty matters more than any demonstration on already known examples.
Why skepticism didn't disappear
For all the strength of the case, the story doesn't reduce to the formula "machine already thinks like a human." In the same evaluation run, GPT-5.4 solved another Tier 4 problem, but preliminary analysis showed that the model could have relied on an old 2011 preprint, which the author of the problem didn't even know about. This is a good example of how the boundary blurs between independent reasoning and very effective literature search, especially if the model can work with the web and quickly collect rare sources.
There is also a second layer of questions — the independence of the benchmark itself. FrontierMath is funded by OpenAI, and the company has access to a significant portion of the problems and solutions. The hidden set, on which GPT-5.4 also showed strong results, partially eases the tension, but does not fully remove the conflict of interest.
Therefore, it is reasonable to read this story in two modes simultaneously: as a real signal of a sharp increase in the mathematical capabilities of models, and as a reminder that the industry still needs independent tests, transparent methodologies, and external verification of striking results.
What it means
The main conclusion is not that mathematicians should be replaced. Rather the opposite: Naskręcki's story shows that leading models are beginning to work as a research partner who reduces the search space and suggests unexpected moves. For science and applied R&D, this is a serious shift: AI increasingly looks less like a calculator and more like a co-author whose ideas can no longer be ignored, but still need to be carefully checked.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.