This website uses cookies to anonymously analyze website traffic using Google Analytics.
Company

Flash Attention received the inaugural Stanford open source software award

May 22, 2024

By 

Together AI

Flash Attention, invented by Tri Dao (Chief Scientist of Together AI), Dan Fu (Academic partner at Together AI), Stefano Ermon (Advisor at Together AI), Atri Rudra, and Christopher Ré (Founder at Together AI) was announced as a winner of the inaugural Stanford Open Source Software Prize at the CORES Symposium from a pool of 75+ projects because of its impact, engagement and adoption across the industry.

FlashAttention is now widely used by companies and researchers to speedup Transformer training and inference. It has been integrated into PyTorch and many Hugging Face to benefit the largest numbers of researchers and developers. The Github repo has received over 11k stars, with contributions from Meta and Mistral.

FlashAttention

FlashAttention is an algorithm that reorders the attention computation and leverages classical techniques (tiling, recomputation) to significantly speed it up and reduce memory usage from quadratic to linear in sequence length. Tiling means that we load blocks of inputs from HBM (GPU memory) to SRAM (fast cache), perform attention with respect to that block, and update the output in HBM. By not writing the large intermediate attention matrices to HBM, we reduce the amount of memory reads/writes, which brings 2-4x wallclock time speedup.

FlashAttention-2

FlashAttention-2 speeds up training and fine-tuning of LLMs by up to 4x and achieves 72% model FLOPs utilization for training on NVIDIA A100s. It builds on Tri and his co-authors’ earlier work with FlashAttention, which is now broadly used by all Transformer based models.  

Designed as a drop-in replacement for FlashAttention, FlashAttention-2 achieves 2x speedup on the core attention operation and 1.3x speedup when training Transformers end-to-end, even compared to the previous implementations that were already highly optimized. Given LLM training runs cost tens of millions of dollars, these improvements could save millions of dollars and enable models with twice as long context.

Resources

  • Lower
    Cost
    20%
  • faster
    training
    4x
  • network
    compression
    117x

Q: Should I use the RedPajama-V2 Dataset out of the box?

RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.

Start
building
yours
here →