The End of Expensive AI: Google and NVIDIA Slash Inference Costs
Training model costs have long been overshadowed by the far larger expense of daily inference. At Google Cloud Next, Google and NVIDIA presented a joint…
AI-processed from AI News; edited by Hamidun News
The artificial intelligence industry has long been held hostage by its own ambitions, masking fundamental economic problems behind flashy announcements. Public attention has traditionally focused on the colossal costs of training new language models, yet the real financial black hole lurks in their everyday operation. The process of generating responses to millions of daily user queries, known in the industry as inference, requires continuous operation of enormous and incredibly energy-intensive computational clusters.
This factor alone has made widespread deployment of truly advanced, multimodal AI economically unfeasible for the vast majority of companies. At Google Cloud Next, tech giants Google and NVIDIA announced the end of this era of infrastructure scarcity, presenting a new joint architecture that promises to slash inference costs by a factor of ten.
The foundation of this impressive technological breakthrough was new A5X compute instances, provided on bare metal infrastructure. The abandonment of classical virtualization completely eliminates performance loss on intermediate software layers, delivering full computational power directly to the algorithms. These instances rely on the monumental NVIDIA Vera Rubin architecture—the long-awaited generational successor to the Blackwell architecture.
The key element of the new infrastructure became the NVL72 rack systems. Unlike the traditional modular approach, where individual graphics processors are combined into standard servers with inevitable bottlenecks in data transmission, the NVL72 is a monolithic compute system the size of an entire cabinet. Within this server rack, seventy-two next-generation graphics processors function as a single giant supercomputer, unified by ultra-fast optical interconnect links.
This radical approach to server hardware architecture solves the primary problem of modern inference—memory bandwidth. Now even the most massive language models with hundreds of billions of parameters can be loaded entirely into the system's shared memory. This frees the cluster from constant, slow, and energy-intensive shuffling of data blocks between individual nodes. The stated tenfold reduction in token generation costs is achieved not only through the raw silicon power of the Rubin architecture chips, but also through unprecedented levels of deep hardware-software co-design. Notably, Google, which possesses its own powerful tensor processors (TPU), undertook such deep integration with NVIDIA, acknowledging the necessity of a hybrid approach to meet the enormous demand from developers.
Engineers from both companies literally rewrote the basic compute management stack, optimizing it for the specific needs of large-scale content generation. New load distribution algorithms at the software level now account for the physical topology of the Vera Rubin rack, minimizing signal latency at the microsecond level. Simultaneously, the use of advanced liquid cooling and new intelligent power controllers allowed a radical reduction in electricity consumption per megabyte of generated data. For modern data centers, where electricity bills often exceed the cost of servers themselves, this is a critical factor in profitability.
The consequences of this infrastructure announcement for the technology market are difficult to overstate, as it breaks the fundamental barrier of unit economics for AI-based services. Until today, independent developers and large corporations were forced to constantly make compromises. They had to artificially limit their product functionality by using less capable but cheaper models, or impose strict request limits to avoid bankruptcy from cloud bills. A tenfold reduction in costs means that business models that seemed like pure fantasy yesterday due to the monstrous computational expenses are now absolutely profitable.
In the near future, cheaper inference will lead to an inconspicuous but monumental revolution in user experience. Complex real-time video analysis, personalized 3D world generation in video games on the fly, and intelligent AI agents that work in the background 24/7, analyzing all incoming information—all of this will be able to become a mass standard, not an expensive premium service. For the cloud provider market, the Google-NVIDIA alliance sets a frighteningly high bar for efficiency.
Traditional approaches to data center construction are rapidly becoming obsolete, yielding to hyper-optimized solutions at the level of entire racks. This partnership marks the most important paradigm shift: the industry is finally transitioning from racing to create the smartest artificial intelligence to a pragmatic race for its cheapest, fastest, and most efficient delivery to every user on the planet.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.