FlashAttention-3 doubles transformer speed at 75% GPU utilization
Together AI introduced FlashAttention-3, a new algorithm for speeding up transformers in large language models. It runs twice as fast as FlashAttention-2. H100

◐ Listen to article
Together AI introduced FlashAttention-3, a new algorithm for speeding up transformers in large language models. It runs twice as fast as FlashAttention-2. H100 GPU utilization now reaches 75%, up from the previous 35%. The algorithm supports low-precision FP8 computation while preserving output accuracy. This lets LLMs process long text sequences more efficiently without slowing computation and lowers costs.