NVIDIA TensorRT now scales generative AI inference across multiple GPUs
NVIDIA has updated TensorRT: the engine now supports inference across multiple GPUs at once. Key optimizations — kernel fusion, memory management…
AI-processed from NVIDIA Developer Blog; edited by Hamidun News
NVIDIA has updated TensorRT, adding native support for inference across multiple GPUs at once — large generative models can now run in production without manual sharding and without losing the engine's key optimizations.
Why a Single GPU is No Longer Enough
Modern generative models grow faster than GPU memory capacity expands. Diffusion networks for video generation, multimodal LLMs with extended context, and complex media content pipelines have long exceeded 80 GB — the upper limit of flagship H100. Developers of inference systems faced a stark choice: either manually split the computational graph and lose TensorRT optimizations, or switch to third-party frameworks with lower throughput.
TensorRT is the de facto standard for production deployment on NVIDIA hardware. The engine optimizes computational graphs at the kernel level: fuses operations, schedules memory usage, applies quantization — and in doing so, delivers the lowest latency and highest throughput among available options. The problem was that all these optimizations previously worked only within a single GPU.
What Multi-Device Inference Provides
The new capability allows TensorRT to automatically distribute a model across multiple GPUs while preserving the full arsenal of optimizations:
- Kernel fusion — merging operations to minimize overhead when transferring data between devices
- Memory planning — intelligent VRAM management across GPUs without excessive tensor copying
- INT8/FP8 quantization — applied to the entire computational graph as a whole, not just individual parts
- Tensor parallelism — automatic distribution of model weights across devices without manual code changes
- Pipeline parallelism — different network layers run in parallel on different cards, increasing overall throughput
Previously, achieving similar results required a complex combination of TensorRT with external tools — TensorRT-LLM or Triton Inference Server — and several weeks of engineering tuning. Now multi-device support is built into the engine itself.
Who Benefits Today
Teams building inference pipelines for media content generation will benefit most from this new capability: text-to-video systems, real-time content adaptation, interactive avatars, multimodal assistants. All these tasks require both large models (meaning lots of memory) and minimal latency (meaning no compromises on optimization).
The new feature also changes the economics of cloud inference. Instead of manually dealing with weight sharding across a GPU cluster and maintaining custom synchronization logic, teams can use the standard TensorRT API — and get the same performance with lower development and maintenance costs.
Particularly noteworthy is the mid-market segment: companies with two to four GPUs but no dedicated ML infrastructure team. For them, removing the barrier to multi-device inference represents the greatest practical shift.
What This Means
Scaling AI inference across multiple devices transitions from "a task for narrow specialists" to "a built-in engine feature." When TensorRT takes control of distribution, the distance between a trained model and a scalable production service shrinks significantly — and this directly impacts which AI products mid-sized teams can afford to launch.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.
The AI world, distilled — once a week
Seven stories that actually mattered, hand-picked. No noise, no reposts, no press releases.
Done! Check your inbox for a confirmation.