FlashAttention-3 Will Accelerate Transformers Twofold at 75% GPU Load
Together AI unveiled FlashAttention-3 — a new algorithm for accelerating transformers in large language models. It works twice as fast as FlashAttention-2. The
AI-processed from Together AI Blog; edited by Hamidun News
Together AI, NVIDIA, and Meta released FlashAttention-3 — an improved algorithm for accelerating attention in transformers. The new version utilizes 75% of modern GPU power instead of the previous 35% and runs 1.5-2x faster on NVIDIA H100 accelerators. This is critical: as cloud computing becomes more expensive and LLMs increasingly demand resources, every percentage point of efficiency now translates into money.
Why It Was a Bottleneck
Attention is the heart of transformers. But it's the most expensive part of the computation. When processing long text, it requires quadratic memory: double the context length and you need four times more memory. FlashAttention solved this problem back in 2022 by reordering computations to read/write to GPU memory more efficiently. This delivered 2-4x speedup and allowed LLMs to expand context from 4K tokens to 128K, and recently to a million. Without FlashAttention, such long contexts would be simply infeasible. But FlashAttention-2 plateaued at 35% H100 efficiency. New Hopper-series GPUs introduced asynchronous cores (WGMMA) and asynchronous data transfer (TMA). FlashAttention-3 finally uses them.
How the Acceleration Works
FlashAttention-3 applies three key improvements. First, asynchronicity. Tensor cores and data transfer systems work simultaneously without waiting for each other. This overlap of computation and memory is the primary acceleration. Second, interleaving of operations. Instead of computing all block matrix multiplications first, then softmax, the algorithm interleaves them. This reduces latency and improves cache utilization. Third, low-precision computation. FP8 — eight-bit numbers instead of FP16 — requires twice as much less memory and allows twice as many operations per second. FlashAttention-3 with FP8 achieves close to 1.2 PFLOPS (petaflops).
Does Quality Suffer with FP8?
The main risk of low precision is that errors accumulate and degrade results. But the authors show that FlashAttention-3 with FP8 has an error only 2.6x greater than baseline FP8 attention. And the error is barely distinguishable from FP16. This is important for long contexts. When an LLM processes a million tokens, errors in one attention layer accumulate through 70+ layers of the model. FlashAttention-3 keeps these errors under control, allowing FP8 use without quality loss.
Who Benefits from This
Various scenarios benefit from FlashAttention-3:
- Training large models — acceleration saves months of computational time. For Meta and OpenAI, this means the ability to train more variants within the same budget. Cloud bills are reduced by 30-50%.
- Fast inference in production — one H100 can serve more users simultaneously, making commercial APIs more cost-effective.
- Long contexts — RAG systems, large document analysis, archive search now work without slowdown even on a million tokens.
- Mobile and edge computing — FP8 and reduced memory requirements allow running models on weaker hardware.
The algorithm is already integrated into major ML libraries PyTorch, JAX, and TensorFlow. Code is published on GitHub and developers are beginning to implement it.
What This Means for the Industry
FlashAttention-3 arrives at the right time. Cloud GPUs are becoming more expensive, demand for LLMs is growing, and contexts are getting longer. Algorithms that push hardware to work at 75% of its power instead of 35% are becoming not just useful—they're critical for economics. This reduces the cost of training models, accelerates production deployment, and opens up possibilities for applications that were previously simply unprofitable. Companies have already rushed to integrate FlashAttention-3 into their systems and achieved significant savings on their bills. For researchers, this is also good news: they can experiment faster, try more architecture variants and model sizes.
*Meta is recognized as an extremist organization and is banned in Russia.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.