FlashConv: Speeding up state space models

State space models (SSMs) are a promising alternative to attention – they scale nearly-linearly with sequence length instead of quadratic. However, SSMs often run slower than optimized implementations of attention out of the box, since they have low FLOP utilization on GPU. How can we make them run faster? In this blog post, we’ll go over FlashConv, our new technique for speeding up SSMs. We’ll see how in language modeling, this helped us train SSM-based language models (with almost no attention!) up to 2.7B parameters – and run inference 2.4x faster than Transformers.
In our blog post on Hazy Research, we talk about some of the algorithmic innovations that helped us train billion-parameter language models with SSMs for the first time.

A Primer on State Space Models
State space models (SSMs) are a classic primitive from signal processing, and recent work from our colleagues at Stanford has shown that they are strong sequence models, with the ability to model long-range dependencies – they achieved state-of-the-art performance across benchmarks like LRA and on tasks like speech generation.
For the purposes of this blog post, there are a few important properties of SSMs to know:
- They generate a sequence length-long convolution during training
- They admit a recurrent formulation, which makes it possible to stop and restart the computation at any point in the convolution
The convolution dominates the computation time during training – so speeding it up is the key bottleneck.
FlashConv: Breaking the Bottleneck
So how do you efficiently compute a convolution that is as long as the input sequence (potentially thousands of tokens)?
FFT Convolution
The first step is using the convolution theorem. Naively, computing a convolution of length N over a sequence of length N takes $O(N^2)$ time. The convolution theorem says that we can compute it as a sequence of Fast Fourier Transforms (FFTs) instead. If you want to compute the convolution between a signal $u$ and a convolution kernel $k$, you can do it as follows:
$iFFT(FFT(u) \odot FFT(k)),$
where $\odot$ denotes pointwise multiplication. This takes the runtime from $O(N^2)$ to $O(N \log N)$.
So we can just use torch.fft, and outperform attention:

Wait… the asymptotic performance looks good, but the FFT convolution is still slower than attention at sequence lengths <2K (which is where most models are trained). Can we make that part faster?
Fused FFT Convolution
Let’s look at what the PyTorch code actually looks like:
Each of those lines requires reading and writing $O(N)$ data to GPU HBM – which makes the entire operation I/O-bound. But we know how to solve this! We can write a custom CUDA kernel that fuses those operations together (and soon, PyTorch 2.0 might do it automatically):

Progress! The crossover point is now 1K – good news for language models.
Block FFT Convolution
But can we do better? It turns out that now the operation is compute-bound.
Why? GPUs have fast specialized matrix multiplication units, such as tensor cores. Attention can take advantage of these, but standard FFT libraries cannot. Instead, they have to use the slower general-purpose hardware – which can be a significant gap in performance (on A100, tensor cores have 16x the FLOPs of general-purpose FP32 computations).
So we need some way to take advantage of the tensor cores on GPU. Luckily, there’s a classic algorithm called the Cooley-Tukey decomposition of the FFT, or six-step FFT algorithm. This decomposition lets us split the FFT into a series of small block-diagonal matrix multiplication operations, which can use the GPU tensor cores. There are more details in the paper, but this gives us more performance again!

More progress! Now the convolution is faster than FlashAttention for any sequence lengths greater than 512 – which is pretty good!
But what are those X marks?
Beyond SRAM: State-Passing
Let’s return to the original step: fusing the convolution together without writing intermediate results to GPU HBM. This is only possible if all the intermediates can fit into GPU SRAM, which is very small (hundreds of KBs on A100). In our case, it means that we can’t fuse operations longer than 4K in sequence length.
But here, the recurrent properties of SSMs save us again! SSMs admit a recurrent view, which lets us stop the convolution halfway through, save a state vector, and restart it. For our purposes, that means that we can split the convolution into chunks, and then sequentially use our block FFT on each chunk – running the state update at every point.
Putting it all together gives us FlashConv, which lets us speed up convolutions by up to 2x over the naive solutions, and outperforms FlashAttention by up to 35x at long sequence lengths.

Fast Training
We can use FlashConv to speed up model training. On the LRA benchmark, we see up to 5.8x speedup over Transformers:
Transformer: 1x
FlashAttention: 2.4x
S4: 2.9x
S4 + FlashConv: 5.8x
We used this speedup to replace attention with SSMs in language models, and scaled our approaches up to 2.7B parameters. Check out our blog post on Hazy Research for more details on H3, the new architecture we developed to get there!
Fast Inference
We compare the generation throughput of a hybrid H3 model and a Transformer model at 1.3B size. For batch size 64, with prompt length 512, 1024, and 1536, hybrid H3 is up to 2.4x faster than Transformer in inference:
Check out our blog post on Hazy Research for more details on H3, the new architecture we developed to get there!
What’s Next
We’re very excited about developing new systems innovations that allow new ideas in deep learning to flourish. FlashConv was critical to the development and testing of H3, a new language modeling approach that uses almost no attention layers.
We’re super excited by these results, so now we’re releasing our code and models to the public! Our code and models are all available on GitHub. If you give it a try, we’d love to hear your feedback!
LOREM IPSUM
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
LOREM IPSUM
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
List Item #2
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
List Item #3
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Build
Benefits included:
✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.
Funding: Less than $5M
Grow
Benefits included:
✔ Up to $30K in free platform credits*
✔ 6 hours of free forward-deployed engineering time.
Funding: $5M-$10M
Scale
Benefits included:
✔ Up to $50K in free platform credits*
✔ 10 hours of free forward-deployed engineering time.
Funding: $10M-$25M
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:
Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:
Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.
- A. 2.08*1e-1 m
- B. 2.08*1e-9 m
- C. 2.08*1e-6 m
- D. 2.08*1e-3 m
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:
Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by
- A. releasing nitrogen in the soil.
- B. crowding out non-native species.
- C. adding carbon dioxide to the atmosphere.
- D. removing water from the soil and returning it to the atmosphere.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:
Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:
Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?
article