ThunderKittens by Together AI: A New Language for Efficient GPU Kernels

Q: What is the source?

Originally published on Together AI Blog. Hamidun News processes and adapts the material with AI.

Q: When was it published?

2026-05-21. Reading time: 3 min.

Together AI has released ThunderKittens—a compact programming language for writing optimized GPU kernels. On the H100 chip, it runs noticeably faster than class

Hamidun News Editorial

AI monitoring · Together AI Blog

2026-05-21· 3 min

AI-processed from Together AI Blog; edited by Hamidun News

ThunderKittens by Together AI: A New Language for Efficient GPU Kernels — Source: Together AI Blog. Collage: Hamidun News.

◐ Listen to article

Together AI has released ThunderKittens—a Domain-Specific Language (DSL) for writing optimized GPU code. The project is positioned as an attempt to simplify neural network development, which currently requires deep knowledge of hardware architecture.

Why This Was Needed

Writing efficient GPU code is black magic for most ML engineers. Chips like the NVIDIA H100 have special tensor cores that deliver 94% of all computational power. But to use them, you must write in CUDA—a complex low-level language understood by only a handful in the industry.

In 2023, the popular FlashAttention2 solution demonstrated that attention operations in transformers could be significantly simplified. But this was just a special case—an algorithm for one specific operation. Developers needed a universal way to write fast code without resorting to this infernal CUDA, one that would work for different kernels.

How ThunderKittens Works

ThunderKittens sits between two extremes. On one side is pure CUDA—very fast but very complex, with a high barrier to entry. On the other is Triton, which hides GPU details and simplifies things but sometimes cannot extract maximum performance from the hardware.

ThunderKittens offers a third path. Its API resembles PyTorch, familiar to all ML developers accustomed to tensor operations. At the same time, it is transparent enough that developers understand what is happening at the hardware level. The authors say: if you know CUDA, you can "compile" ThunderKittens in your head.

The fundamental object in ThunderKittens is a tile—a matrix sized to fit within a tensor core. It is a unit of computation that allows full loading of specialized cores and extraction of all potential from modern hardware.

Performance and Results

On modern A100 and RTX 4090 chips, ThunderKittens matches FlashAttention2 in speed—while the code is somewhat shorter and clearer. On H100, this solution wins: faster than FlashAttention2 on both forward and backward propagation. In other words, there is no trade-off between code clarity and speed.

The authors have already written several kernels in ThunderKittens for other algorithms:

Based—an optimized version of linear attention
Hedgehog and other specialized kernels for transformers
Several solutions that outperform Triton versions in speed

An interesting point: engineers who took just a two-hour CUDA course were able to write their own ThunderKittens code. This suggests that the language truly simplifies development while not hiding hardware details.

Open Project and Education

The authors from Together AI are candid: this is an art project. Don't expect regular updates and support for every complaint in the issue tracker. The project is released as open-source because the developers believe it is worthwhile to share ideas and tools with the community.

Along with ThunderKittens, they released NanoGPT-TK—a version of Andrej Karpathy's iconic NanoGPT project where core computational kernels are rewritten in ThunderKittens. This was done specifically for education and demonstration purposes. NanoGPT has long been recognized as one of the best AI projects for understanding how transformer training works from scratch.

What This Means

ThunderKittens shows that in AI there is a real gap between the convenience of abstractions (PyTorch, Triton) and control over real hardware (CUDA). It turns out that developers are willing to write somewhat more complex code if it gives them control and real speed in practice.

For ML engineers, this could mean that in the future, porting a trained model from one chip to another will be easier—it will suffice to rewrite a few GPU kernels rather than redo half the infrastructure. For researchers, it is a tool for rapid experimentation with specialized algorithms that does not require a month-long CUDA course.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation