MarkTechPost→ original

Google argues for thinking deeper, not longer — and halves inference costs

A joint study by Google and the University of Virginia challenges a central dogma of recent years: the longer the Chain-of-Thought, the better the model's answe

AI-processed from MarkTechPost; edited by Hamidun News
Google argues for thinking deeper, not longer — and halves inference costs
Source: MarkTechPost. Collage: Hamidun News.
◐ Listen to article

For over the past few years, the large language model industry has lived by an unspoken rule: if you want a more accurate answer from a model on a complex question, make it think longer. The Chain-of-Thought technique, where a model builds a chain of reasoning step by step before the final answer, became the gold standard. Developers lengthened these chains, expanded context windows, spent more computational resources on each request. But new research conducted jointly by the University of Virginia and Google argues: we've been confusing verbosity with intelligence all along.

The idea underlying the work is deceptively simple. Researchers asked: does each additional token in a reasoning chain really bring the model closer to the correct answer? Or is a significant portion of these tokens information noise, repetition, and treading water? To answer this question, the team introduced a new metric — Deep-Thinking Ratio, or coefficient of deep thinking. Instead of measuring reasoning length in tokens, this metric evaluates what fraction of the reasoning actually contains productive logical steps — those that lead to solving the problem rather than just filling space.

The results were striking. Analysis showed that in typical long reasoning chains of modern LLMs, a huge portion of intermediate steps carries no real semantic load. A model can rephrase the same thought dozens of times, return to already covered stages, generate redundant explanations — and all of this costs real money. Each extra token at the inference stage — this is GPU time, electricity, and latency for the end user. At the scale of large services processing billions of requests, we're talking about colossal sums.

The key achievement of the research is that by optimizing the reasoning process with the Deep-Thinking Ratio in mind, they managed to achieve two things simultaneously that are usually considered mutually exclusive. The accuracy of the model's answers improved — because cutting out unproductive steps actually reduces the probability that the model will "get lost" in its own reasoning and reach an erroneous conclusion. And the overall inference costs were reduced by approximately half — because the model generates significantly fewer tokens per request. This is not a compromise between quality and cost, but a rare case where optimizing one parameter improves both.

To understand the scale of this discovery, it's worth recalling the context. Inference cost is one of the industry's main headaches. OpenAI, Google, Anthropic, and other companies spend billions of dollars on computational infrastructure, and a significant portion of these expenses falls on generating answers for users. Models like OpenAI's o1 and o3, as well as Google's Gemini with extended thinking, were specifically designed for long reasoning chains. If it turns out that half of these reasoning chains can be painlessly cut out — or more precisely, teach the model not to generate them in the first place — the economic effect will be measured in hundreds of millions of dollars annually.

There is also a deeper theoretical aspect. The research effectively questions the inference scaling paradigm itself that dominated in 2024-2025. If "thinking longer" doesn't equal "thinking better," then the race to expand context windows and increase computational budgets for reasoning is a dead-end path. Instead, the industry might want to focus on the quality of each reasoning step rather than their quantity. This echoes how human thinking works: an expert solves a problem not because they think longer than a beginner, but because each of their thinking steps is more purposeful.

The practical consequences for developers and users may manifest fairly quickly. Deep-Thinking Ratio is a metric that's relatively straightforward to integrate into existing model training and evaluation pipelines. We can expect that major labs will start using similar approaches during fine-tuning, and cloud providers will use them to optimize API call costs. For end users, this means faster and more accurate answers at the same or lower price point.

Google and University of Virginia's research reminds the industry of an important truth that's easy to forget in the race for scale: efficiency is not about "more," but about "more accurate." The models of the future will probably not be those that think longest, but those that know how to think substantively.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…