Together AI: GPT-5.5, Gemini and Opus cannot write fast multi-GPU kernels

Together AI released ParallelKernelBench, a benchmark of 87 CUDA kernel generation tasks for multi-GPU systems drawn from real codebases. The best models…

Hamidun News Editorial

AI monitoring · Together AI Blog

Jun 30, 2026· 3 min

AI-processed from Together AI Blog; edited by Hamidun News

Together AI: GPT-5.5, Gemini and Opus cannot write fast multi-GPU kernels — Source: Together AI Blog. Collage: Hamidun News.

◐ Listen to article

Together AI published ParallelKernelBench (PKB) — the first open benchmark for evaluating language models' ability to generate efficient CUDA kernels for multi-GPU environments. After testing over 40 models on 87 real-world tasks, researchers discovered: the best frontier models solve less than a third of tasks correctly — and only a handful of them actually outperform a naive PyTorch implementation.

Why Multi-GPU Is Harder

Language models have already become reasonably good at writing code for a single GPU, and most existing GPU programming benchmarks are limited to exactly this scenario. But real production AI systems have long moved beyond this limit: they operate on dozens and hundreds of GPUs simultaneously. In such configurations, computational power ceases to be the main bottleneck — that role is taken by communication between devices. According to Together AI, the overhead of data transfer between GPUs consumes more than 20% of inference latency — and this gap will grow, since chip power continues to outpace inter-chip link bandwidth. Multi-GPU code generation differs fundamentally from single-GPU code for three reasons:

Combinatorial explosion of options — you must choose between tensor, context, expert, data, and other types of parallelism, each creating its own communication pattern
Different performance model — instead of local computational roofline, the main constraint becomes inter-chip link bandwidth
New architectural choices — how to physically move data between GPUs: through copy engine, TMA, SM load/store, or NVLS, and whether to overlap data transfer with computation

How the Benchmark Works

PKB includes 87 tasks from real codebases: Megatron-LM, DeepSpeed, DeepEP, TensorRT-LLM, and NeMo-RL — as well as non-standard distributed workloads: routing in graph neural networks, distributed FFT, and Gaussian splatting. This selection covers all major sharding approaches: tensor, context, data, expert, sequence, and FSDP/ZeRO. Each task starts with a standard PyTorch + NCCL implementation and a hardware topology description. The model must replace it with its own CUDA kernel that transmits data directly between GPUs over NVLink through symmetric memory, bypassing the standard collective operations stack. Evaluation is based on three criteria: result correctness, wall-clock speedup, and achieving communication roofline — the theoretical limit of link bandwidth.

Results and Unexpected Victories

Together AI tested over 40 models, including GPT-5.5, Gemini 3 Pro, and Opus 4.7 — the current flagship variants from OpenAI, Google, and Anthropic. The results were equally disappointing for everyone:

The best model solved correctly less than a third of 87 tasks
Less than a quarter of correct solutions outperformed the naive baseline PyTorch + NCCL implementation
Main failures — in managing communication between GPUs and choosing the correct data transfer method

At the same time, several solutions proved unexpectedly strong: individual generated kernels surpassed all publicly available implementations. Particularly telling is the case with GRPO training in NVIDIA NeMo-RL — for this operation, no optimized public code existed until now, and the language model wrote it before humans did.

"Several generated kernels turned out to be faster than anything available publicly," — from the

ParallelKernelBench technical report.

What This Means

PKB marks the next frontier in AI coding evolution: the transition from single GPU to distributed multi-GPU systems. For now, frontier models cannot handle this — but rare flashes of success suggest progress is possible with focused collection of specialized training data. For teams optimizing inference and training on GPU clusters, this is an important benchmark: the tool is maturing, but is not yet ready for broad adoption.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation