Habr AI Compared Claude, Gemini, and ChatGPT on Text, Math, Analysis, and Creativity

Q: What is the source?

Originally published on Habr AI. Hamidun News processes and adapts the material with AI.

Q: When was it published?

Apr 30, 2026. Reading time: 3 min.

Habr AI released a comparison of three flagship LLMs—ChatGPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro. The emphasis is placed not on routine prompts, but on…

Hamidun News Editorial

AI monitoring · Habr AI

Apr 30, 2026· 2 min

AI-processed from Habr AI; edited by Hamidun News

Habr AI Compared Claude, Gemini, and ChatGPT on Text, Math, Analysis, and Creativity — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

Habr AI published a comparison of three flagship models — ChatGPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro. Instead of the usual tests like "write a story" or "solve a problem," the author shifted focus to non-standard scenarios where the real differences between systems become more visible.

Non-standard test format

The main idea of the piece is not to find an absolute winner, but to check how models behave outside the most worn-out demos. When LLMs are asked to write short text, generate a code template, or solve a school-level problem, they often show similar performance. But in more unusual, boundary cases, or simply less formulaic tasks, their thinking style, flexibility, resilience to ambiguity, and ability to maintain context without hints at every step begin to emerge.

This approach matters because users increasingly apply models not for a single isolated command, but as an intellectual tool for work. In real practice, you need not only to "answer correctly," but also to understand hidden requirements, not break down on wording, not drift into unnecessary chatter, and not lose logic in the middle of reasoning. This is why comparison through unusual assignments looks more useful than yet another formal benchmark.

Three flagship models

The test features ChatGPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro — three systems that typically appear in the top tier of discussions about generation quality. The composition of participants itself shows that this is not about a niche experiment, but about comparing current flagships, between which advanced users, editors, analysts, and teams that have integrated LLM into daily workflows most often choose.

It's also important that the author doesn't attempt to present the material as a final market verdict. Rather, it's an attempt to answer a more practical question: where exactly do differences between the best models become noticeable. In routine tasks, the gap may be small, but in scenarios with ambiguity, combined requirements, and creative constraints, each model shows its own character. For the reader, this is more useful than a dry rating because it helps match a model to a specific type of work.

What's being evaluated

According to the article's description, the focus is shifted to four groups of tasks that are closer to real-world use than to a demonstration presentation. It's not about checking a single metric, but about trying to see how a model switches between different types of thinking: from careful editing and formal logic to calculations and free idea generation. Such a set allows evaluating not a single narrow ability, but the behavior of the system in different modes — from accuracy to creativity.

Text work and quality of formulation
Math and resilience to calculation errors
Analytical tasks with multiple conditions
Creativity in atypical and not fully formalized requests

The strength of such a comparison is that it shows not only the level of a model's knowledge, but also the character of its response. One system may be more careful in structure, another bolder in ideas, a third more stable in logic. For a user, this is often more important than an abstract first place, because the choice of LLM depends not on general hype, but on what exactly needs to be done: edit text, verify reasoning, solve problems, or quickly find unconventional solutions.

What this means

Comparisons like this are gradually changing the way we talk about LLMs. The question is no longer "who is smarter overall," but "which model better handles your actual scenario." For the market, this is a sign of maturity: flagships become strong enough to be evaluated not by wow-factor, but by nuances of performance.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation