FlashAttention-4: How Together AI Accelerated Attention on Blackwell GPUs
FlashAttention-4 redesigned the attention kernel specifically for Blackwell. The speedup comes from new tensor memory (TMEM) and the 2-CTA MMA mode, which addre
AI-processed from Together AI Blog; edited by Hamidun News
FlashAttention-4 is a redesign of the attention algorithm specifically for modern GPUs, where tensor cores grow in performance significantly faster than memory and other resources.
Why the Standard Approach No Longer Works
At first glance, attention performance is controlled by the speed of two matrix multiplications: S = Q × K^T and O = P × V. But analysis of the Blackwell B200 reveals something unexpected: the bottleneck lies not in the tensor cores, but in specialized functional units (SFU) for computing the exponent in softmax (forward pass) and in shared memory traffic (backward pass).
From Hopper (H100) to Blackwell (B200), the BF16 tensor core performance grew from 1 to 2.25 petaflops, while the number of SFUs and shared memory bandwidth remained the same. This asymmetry breaks standard optimization—you cannot simply assume that cores fully determine performance. In reality, they interfere with each other.
How Together AI Solved the Problem
The research team, together with engineers from NVIDIA, Meta, and Princeton, proposed three key ideas:
- New pipelining—software pipelines that maximally overlap tensor core, SFU, and memory operations without stalls.
- Polynomial approximation of exponentials—instead of using slow SFU units, the forward pass computes exponentials on fast FMA units (fused multiply-add).
- TMEM and 2-CTA MMA—leveraging new tensor memory (256 KB per SM) plus a mode where two thread blocks work on a single matrix operation, reducing shared memory traffic.
Blackwell Capabilities That Made This Possible
Blackwell specifically added several features to enable such optimizations:
Tensor memory (TMEM)—fast on-chip storage (256 KB per SM) wired directly to tensor cores. Intermediate results can stay in TMEM without going to slower shared memory, dramatically reducing access latency.
Asynchronous 5th-generation tensor cores—each core is launched by a single thread and accumulates results in TMEM. Maximum tile for BF16 is 128×256×16 (roughly 2× larger than Hopper), enabling deeper pipelining without register overflow.
2-CTA MMA—a new mode where two thread blocks simultaneously work on a single matrix operation. This cuts shared memory load in half and reduces the number of atomic operations.
Numbers and Results
FlashAttention-4 on Blackwell B200 with BF16 achieves 1605 TFLOPs/s (71% utilization). This is 1.3× faster than cuDNN 9.13 and 2.7× faster than Triton. For context: 1605 petaflops is nearly half of Blackwell's peak performance, yet it's squeezed out of a complex attention kernel.
What This Means
FlashAttention-4 demonstrates how to work in the era of asymmetric GPU scaling—not tweaking old algorithms, but redesigning them in tandem with new hardware capabilities. The result looks like an achievement, but it's just the beginning of adapting to the new hardware reality.
*Meta is recognized as an extremist organization and banned in Russia.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.