NVIDIA Developer Blog→ original

NVIDIA Vera Rubin: How Developers Will Scale Agentic AI Without Latency

NVIDIA released Vera Rubin—a platform for high-speed agentic AI. It combines the Vera Rubin GPU and Groq 3 LPX accelerator. On trillion-parameter models, it ach

AI-processed from NVIDIA Developer Blog; edited by Hamidun News
NVIDIA Vera Rubin: How Developers Will Scale Agentic AI Without Latency
Source: NVIDIA Developer Blog. Collage: Hamidun News.
◐ Listen to article

NVIDIA introduced the Vera Rubin platform, which solves the main challenge in scaling agentic AI—unpredictable latencies in multi-turn sessions.

Why Agentic AI Is Harder to Scale

Typical model scaling works for batch processing: feed many texts, get many responses. But agentic AI works differently. An agent makes a decision, takes an action, observes the result, and makes the next decision. This translates to hundreds of model requests in a single session, each with a small batch size and extremely strict speed requirements.

Since the agent's trajectory is unpredictable in advance (which action the agent will choose is unknown), compiling and optimizing execution is difficult. Latencies accumulate, and the 400K-token context becomes a bottleneck.

Vera Rubin's Three-Layer Solution

Instead of a one-size-fits-all approach, NVIDIA embedded three technologies working together in the platform:

  • Direct cable connections between chips—each LPU is connected to 96 others at 112 Gbps, providing 640 TB/sec bandwidth per rack without switches
  • Compiler plans all data transfers in advance—instead of deciding at runtime when and where to send data, the compiler pre-calculates every bit of information across the network
  • Synchronizing thousands of independent chips—the system aligns the clocks of LPU accelerators so the network operates with known, predictable latency

Hybrid Acceleration: NVIDIA + Groq

The platform leverages a division of labor between GPU and specialized accelerators. Vera Rubin NVL72 handles attention layers (which favor high throughput), while Groq 3 LPX takes on FFN layers (which require low latency during sequential generation). The KV-cache is synchronized between them one token at a time.

It sounds complicated, but the result is a system that operates without compromises between speed and quality.

What Was Achieved

  • 400 tokens per second on 1-trillion-parameter MoE models with 400K context
  • 35x more throughput per watt than GB200 NVL72
  • Predictable latency even when running multiple agents simultaneously

What This Means

For AI agent developers, this means that latency and scalability are no longer adversaries. Vera Rubin allows deploying large models (trillion parameters) and running complex agents within them without compromising response speed. In practice, this means that personal assistants, automation systems, and worker agents will be able to operate quickly even with extended context.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…