Together AI Blog→ original

Together AI achieves 90% faster training on NVIDIA Blackwell

Together AI shared results on NVIDIA Blackwell: Llama 70B training ran 90% faster than on H100. 15,264 tokens/sec versus 8,080, thanks to its custom optimized k

Together AI achieves 90% faster training on NVIDIA Blackwell
Source: Together AI Blog. Collage: Hamidun News.
◐ Listen to article

Together AI announced immediate access to GPU clusters accelerated by NVIDIA Blackwell and presented its own optimization stack, specially adapted for the new neural network hardware architecture.

Results: 90% acceleration versus H100

When testing the Llama model with 70 billion parameters, the Together AI team achieved 15,264 tokens per second on a single GPU. This nearly doubles the result on the previous generation NVIDIA HGX H100, which in optimized configuration processed 8,080 tokens per second.

The results were achieved through an optimized version of TorchTitan combined with Together Kernel Collection — the company's own collection of optimized kernels. For context: this involves BF16 precision (Brain Float 16 — a compromise between speed and accuracy, which is now the standard for training large models). According to the company, with additional optimizations still in development, speed will continue to grow.

How it works: optimization at the architecture level

The acceleration became possible due to deep optimization tailored to the specific GPU architecture. Together AI developed a number of components that fully utilize the capabilities of NVIDIA Blackwell:

  • Custom FP8 kernels working with 5th generation NVIDIA Tensor Cores (high-performance computing blocks)
  • Attention kernels operating 1.8x faster than FlashAttention-3 (current standard for optimized attention mechanism)
  • Integration with the open ThunderKittens library for full utilization of dedicated on-chip memory
  • Distributed training algorithms adapted to Quantum-2 InfiniBand network topology

Tri Dao, chief scientist at Together AI and creator of FlashAttention, noted: "We optimize every level of the AI stack to fully leverage GPU architecture advances. We especially like the new Tensor Cores and microscaling format for inference acceleration. The combination of Together Kernel Collection with NVIDIA Blackwell redefines the standards for efficient training and inference at scale."

Testing program and scaling

As part of an exclusive launch program, Together AI invites eight pioneering AI companies to direct access to dedicated HGX B200 nodes and the opportunity to collaborate with NVIDIA engineers and Together AI researchers. The goal is to jointly accelerate workloads and find further optimizations.

In parallel, the company is deploying tens of thousands of HGX B200 servers and complete GB200 NVL72 solutions with NVIDIA Quantum-2 InfiniBand networks. This includes the previously announced cluster of 36,000+ GPUs for training next-generation models and agents.

What this means

For AI companies the result is practical: training large models will become cheaper and faster. With 90% speed gain, models that previously required weeks now train in days. This significantly reduces capital expenditure on compute and accelerates the experimentation cycle for new architectures. For the market overall, this is a signal: the era of generic GPU services is ending. AI companies that write their own optimized kernels for specific architectures (like Together AI with ThunderKittens) gain a competitive edge in speed and cost. And this directly affects the price of training and, ultimately, the price of AI services for end users.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.
What do you think?
Loading comments…