FlashAttention-4: how Together AI accelerated attention on Blackwell GPUs
FlashAttention-4 rebuilt the attention kernel specifically for Blackwell. The speedup comes from the new tensor memory (TMEM) and the 2-CTA MMA mode, which addr

◐ Listen to article
FlashAttention-4 rebuilt the attention kernel specifically for Blackwell. The speedup comes from the new tensor memory (TMEM) and the 2-CTA MMA mode, which address the bottleneck—not in matrix operation speed, but in the SFU for softmax and in memory. Result: 1605 TFLOPs/s (71% utilization), 1.3× vs cuDNN and 2.7× vs Triton.