Together AI Blog→ original

FlashAttention-4: how Together AI accelerated attention on Blackwell GPUs

FlashAttention-4 rebuilt the attention kernel specifically for Blackwell. The speedup comes from the new tensor memory (TMEM) and the 2-CTA MMA mode, which addr

FlashAttention-4: how Together AI accelerated attention on Blackwell GPUs
Source: Together AI Blog. Collage: Hamidun News.
◐ Listen to article

FlashAttention-4 rebuilt the attention kernel specifically for Blackwell. The speedup comes from the new tensor memory (TMEM) and the 2-CTA MMA mode, which address the bottleneck—not in matrix operation speed, but in the SFU for softmax and in memory. Result: 1605 TFLOPs/s (71% utilization), 1.3× vs cuDNN and 2.7× vs Triton.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.
What do you think?
Loading comments…