Together AI Blog→ original

Together AI: How Kernel Optimizations Close the Gap Between Models and GPUs

Together AI's team adapted CUDA kernels for new Blackwell GPUs in one week—work that NVIDIA spent a year developing with dozens of specialists. This breakthroug

AI-processed from Together AI Blog; edited by Hamidun News
Together AI: How Kernel Optimizations Close the Gap Between Models and GPUs
Source: Together AI Blog. Collage: Hamidun News.
◐ Listen to article

Together AI's kernel optimization team has just shown that the gap between theory and practice in AI is not inevitable, but a direct challenge for engineers. In one week, they adapted low-level kernels for new Blackwell GPUs, completing work that NVIDIA spent a whole year on with dozens of specialists.

History of One Revolution

It all started at an unexpected moment. May 2022, Memorial Day in the US. While Silicon Valley was resting, Dan Fu, Tri Dao, and their colleagues published the FlashAttention paper. The main idea sounded provocative: transformer attention optimization was far from finished, contrary to popular belief.

Before that, experts believed that GPUs were already fully utilized. Sparsity (matrix sparsification) and low-rank methods showed only 10% of actual gains. FlashAttention took a different path: they didn't seek mathematical magic, but simply understood how memory actually moves in GPUs.

By applying principles from database management systems (memory locality, cache hierarchy) to attention, they achieved a 2-3x speedup. Andrej Karpathy, then Senior Director of AI at Tesla, sent a tweet at 7:00 PM on Monday. By Tuesday morning, the paper was already circulating across all AI research channels.

"Honestly, we didn't expect anyone to notice," Dan recalls. This moment became the foundation for what is now one of the most influential kernel research teams in AI.

The Gap Nobody Saw

Here's what most people miss in AI discussions: having the best models and best hardware isn't enough. The real bottleneck is the gap between them: the software layer that translates mathematical operations into GPU instructions. This is the kernel layer.

Many fundamental architectures (ResNet, LSTM, RNN) were designed before the era of massive scaling. When models grew to hundreds of billions of parameters, GPUs evolved in parallel. Modern chips are essentially specialized matrix multipliers, optimized for dominant transformer architectures.

A kernel is a translation between abstraction and silicon. It's a GPU instruction on how to efficiently move data and perform computations. A good kernel unlocks the full power of the hardware. A bad one leaves it underutilized.

For AI-native applications (products built on AI), this gap is critical:

  • You can't build a responsive AI application on infrastructure running below optimal
  • Infrastructure costs skyrocket if kernels are suboptimal
  • Scaling an AI business remains impossible if inference costs twice as much as it should

ThunderKittens and Blackwell: A Week Instead of a Year

March 2025. The team grew to 15 people—a mix of ML researchers learning from systems challenges and GPU veterans who moved into AI. Together AI gained access to new NVIDIA Blackwell GPUs—a generation with a fundamentally different architecture.

The challenge was concrete: NVIDIA spent a year, engaging dozens of engineers, to develop optimized kernels for Blackwell. Together AI set themselves a goal: one week.

The solution was built from what they had developed with Stanford researchers—the ThunderKittens library. Instead of manual coding specific to each new GPU generation, they created a universal framework that scales.

In 5 days, they did work that normally takes a year. It's not just a matter of development speed. It's proof that their kernel methodology truly scales and generalizes to new hardware without starting from scratch.

What This Means

AI-native clouds need AI-native infrastructure, optimized from silicon up. The gap between models and GPUs isn't closed in scientific papers or at conferences—it's closed in code, in kernels, in how data physically moves through chip memory. The team that understands this and can execute quickly wins in this era.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…