Company

Flash Attention received the inaugural Stanford open source software award

May 22, 2024

・

Together AI

Flash Attention, invented by Tri Dao (Chief Scientist of Together AI), Dan Fu (Academic partner at Together AI), Stefano Ermon (Advisor at Together AI), Atri Rudra, and Christopher Ré (Founder at Together AI) was announced as a winner of the inaugural Stanford Open Source Software Prize at the CORES Symposium from a pool of 75+ projects because of its impact, engagement and adoption across the industry.

FlashAttention is now widely used by companies and researchers to speedup Transformer training and inference. It has been integrated into PyTorch and many Hugging Face to benefit the largest numbers of researchers and developers. The Github repo has received over 11k stars, with contributions from Meta and Mistral.

FlashAttention

FlashAttention is an algorithm that reorders the attention computation and leverages classical techniques (tiling, recomputation) to significantly speed it up and reduce memory usage from quadratic to linear in sequence length. Tiling means that we load blocks of inputs from HBM (GPU memory) to SRAM (fast cache), perform attention with respect to that block, and update the output in HBM. By not writing the large intermediate attention matrices to HBM, we reduce the amount of memory reads/writes, which brings 2-4x wallclock time speedup.

FlashAttention-2

FlashAttention-2 speeds up training and fine-tuning of LLMs by up to 4x and achieves 72% model FLOPs utilization for training on NVIDIA A100s. It builds on Tri and his co-authors’ earlier work with FlashAttention, which is now broadly used by all Transformer based models.

Designed as a drop-in replacement for FlashAttention, FlashAttention-2 achieves 2x speedup on the core attention operation and 1.3x speedup when training Transformers end-to-end, even compared to the previous implementations that were already highly optimized. Given LLM training runs cost tens of millions of dollars, these improvements could save millions of dollars and enable models with twice as long context.

Resources

FlashAttention-2 is available in open source on Github
You can also begin fine-tuning with FlashAttention-2 on Together API
Learn more about FlashAttention-2 and Flash-Decoding for long-context inference

Lower
Cost
20%
faster
training
4x
network
compression
117x

Q: Should I use the RedPajama-V2 Dataset out of the box?

RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.

Links in this
article

Flash Attention received the inaugural Stanford open source software award

Q: Should I use the RedPajama-V2 Dataset out of the box?

Subscribe to newsletter