NVIDIA TensorRT now scales generative AI inference across multiple GPUs

NVIDIA has updated TensorRT: the engine now supports inference across multiple GPUs at once. Key optimizations — kernel fusion, memory management…

Hamidun News Editorial

AI monitoring · NVIDIA Developer Blog

Jun 29, 2026· 2 min

AI-processed from NVIDIA Developer Blog; edited by Hamidun News

NVIDIA TensorRT now scales generative AI inference across multiple GPUs — Source: NVIDIA Developer Blog. Collage: Hamidun News.

◐ Listen to article

NVIDIA has updated TensorRT, adding native support for inference across multiple GPUs at once — large generative models can now run in production without manual sharding and without losing the engine's key optimizations.

Why a Single GPU is No Longer Enough

Modern generative models grow faster than GPU memory capacity expands. Diffusion networks for video generation, multimodal LLMs with extended context, and complex media content pipelines have long exceeded 80 GB — the upper limit of flagship H100. Developers of inference systems faced a stark choice: either manually split the computational graph and lose TensorRT optimizations, or switch to third-party frameworks with lower throughput.

TensorRT is the de facto standard for production deployment on NVIDIA hardware. The engine optimizes computational graphs at the kernel level: fuses operations, schedules memory usage, applies quantization — and in doing so, delivers the lowest latency and highest throughput among available options. The problem was that all these optimizations previously worked only within a single GPU.

What Multi-Device Inference Provides

The new capability allows TensorRT to automatically distribute a model across multiple GPUs while preserving the full arsenal of optimizations:

Kernel fusion — merging operations to minimize overhead when transferring data between devices
Memory planning — intelligent VRAM management across GPUs without excessive tensor copying
INT8/FP8 quantization — applied to the entire computational graph as a whole, not just individual parts
Tensor parallelism — automatic distribution of model weights across devices without manual code changes
Pipeline parallelism — different network layers run in parallel on different cards, increasing overall throughput

Previously, achieving similar results required a complex combination of TensorRT with external tools — TensorRT-LLM or Triton Inference Server — and several weeks of engineering tuning. Now multi-device support is built into the engine itself.

Who Benefits Today

Teams building inference pipelines for media content generation will benefit most from this new capability: text-to-video systems, real-time content adaptation, interactive avatars, multimodal assistants. All these tasks require both large models (meaning lots of memory) and minimal latency (meaning no compromises on optimization).

The new feature also changes the economics of cloud inference. Instead of manually dealing with weight sharding across a GPU cluster and maintaining custom synchronization logic, teams can use the standard TensorRT API — and get the same performance with lower development and maintenance costs.

Particularly noteworthy is the mid-market segment: companies with two to four GPUs but no dedicated ML infrastructure team. For them, removing the barrier to multi-device inference represents the greatest practical shift.

What This Means

Scaling AI inference across multiple devices transitions from "a task for narrow specialists" to "a built-in engine feature." When TensorRT takes control of distribution, the distance between a trained model and a scalable production service shrinks significantly — and this directly impacts which AI products mid-sized teams can afford to launch.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation