Research

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

June 18, 2024

By 

Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin

We introduce SpecExec, a new speculative decoding method that applies the classical approach of “speculative execution” to LLM inference. Using SpecExec, we attain inference speeds for 70B parameter LLMs on consumer GPUs with RAM offloading at 4-6 tokens per second with 4-bit quantization or 2-3 tokens per second with 16-bit weights. These speeds correspond to speedups over autoregressive decoding of up to 10.6x and 18.7x, respectively.

Background

As large language models (LLMs) like LLaMA and Mistral gain widespread adoption, AI enthusiasts and practitioners are looking for ways to run them faster and on less expensive consumer hardware. Given the limited memory available on consumer GPUs (e.g., 24 GB on RTX 4090), many large models cannot even fit on such devices, thus necessitating offloading model parameters for inference. In offloading, the model is stored in RAM and layers are loaded onto the GPU sequentially during a forward pass. Naturally, this is quite slow, since transferring a 70B parameter model from RAM to GPU in 16-bit precision can take over 5 seconds even with a PCIe gen 4 bus.

To accelerate LLM inference, one can use speculative decoding. This approach typically involves using a much smaller “draft” model to quickly generate proposed continuation tokens for the input sequence. The main “target” model can then verify these proposed tokens in parallel and choose which (if any) tokens to accept, using a stochastic sampling algorithm. Given that LLM decoding is extremely memory-bound (especially when offloading), the target model can verify many (several thousand when offloading!) speculated tokens in the same amount of time as it would take to generate a single token. Therefore, as long as the average number of generated tokens per iteration can compensate for the overhead of running the draft model, this approach speeds up inference.

The SpecExec method

Our method named SpecExec (after Speculative Execution) was designed to maximize the speedups from speculative decoding with offloading. Unlike most speculative decoding methods, SpecExec does not use a stochastic verification algorithm to decide which tokens to accept by comparing their respective probabilities from the draft and target models. Instead, SpecExec directly applies speculative execution to LLM inference. It uses a powerful draft model to deterministically construct a large draft tree containing the most likely continuations of the input text (according to the draft model). It then precomputes and caches the target model’s next-token distributions for each node in the tree (representing a partial continuation of the text) using a single forward pass of the target model. Lastly, it sequentially samples from these distributions, starting at the root of the tree, until a token is sampled that falls outside of the tree ("cache miss"); at this point, the process repeats, and a new draft tree is constructed.

The correctness of this algorithm is easy to see: because we are always sampling from the target model’s next-token distributions, it is clear that we maintain the same output distribution as regular autoregressive decoding from the target model.

This method works especially well because of the “spikiness” of token probability distributions in modern LLMs. As shown in the figure below (left), the top-1 token in the Llama-2 70B model contains almost 90%+ of its probability mass on average. Furthermore, a strong draft model is often able to predict these very likely tokens from the target model. For example, the top-4 tokens according to Llama-2 7B on average cover almost 90% of the Llama-2 70B probability distribution. This means that even with a small “cache” generated from the most likely draft model tokens, a sample from the target model is very likely to get a “cache hit”.

Average cumulative probability (according to Llama-2 70B) of the top-k tokens according to different draft models, on the Open Assistant dataset.

We note that this method performs best with a very capable draft model (e.g., Llama2-7B for Llama2-70B target), but this is perfectly suited for the offloading regime. While small draft models are often preferable for on-chip speculative decoding due to their speed, in the offloading setting we can afford much larger draft models: as long as the draft model fits on the GPU, a single forward pass of the draft model will still be much faster than a forward pass from the offloaded target model!

SpecExec Performance

In the figure above, we compare SpecExec’s performance with the popular speculative decoding method named SpecInfer. Specifically, we compare the average number of generated tokens per iteration for different speculative budget sizes, as we want to understand which method scales better to larger budgets. Both methods perform similarly at low token budgets, but as the budget grows, SpecInfer’s performance plateaus while SpecExec continues improving to over 20 generated tokens per step with budgets beyond 1K. The chart above is based on the MT-Bench dataset and Llama 2-7B/70B chat models (temperature=0.6, top-p=0.9).

The table below compares the end-to-end speedups from SpecExec vs SpecInfer with offloading on an A100 GPU. While SpecInfer shows impressive speedups, SpecExec more than doubles its performance both in speed and in accepted token counts, attaining speedups of 16.4x-18.7x relative to autoregressive decoding with offloading.

Inference speed with RAM offloading, A100 GPU, Chat / Instruct models, using SpecExec (SX) and SpecInfer (SI) methods

Draft / Target models t Method Budget Gen. rate Speed, tok/s Speedup
Llama2-7b / 70b 0.6 SX 2048 20.60 3.12 18.7x
0.6 SI 1024 8.41 1.34 8.0x
0 SX 1024 18.8 2.74 16.4x
0 SI 1024 7.86 1.18 7.1x
Llama2-7b / 70b GPTQ 0.6 SX 128 12.10 6.02 8.9x
0 SX 256 13.43 6.17 9.1x
Mistral-7b / Mixtral-8x7b-GPTQ 0.6 SX 256 12.38 3.58 3.5x
Llama3-8b / 70b 0.6 SX 1024 18.88 2.62 15.6x
Llama3-8b / 70b 0.6 SX 1024 18.16 2.79 16.6x
0 SX 2048 21.58 2.94 17.5x

SpecExec can speed up LLM inference for various types of hardware. In addition to researcher-grade A100, we evaluated SpecExec with consumer GPUS ranging from 2080Ti to 4090. The results below were achieved with a quantized model (4-bit AWQ Llama-2 70B) that fits in the RAM of most consumer-grade computers. Note that speedup ranges from 4.6x to 10.6x, allowing generation speed in the 3-6 tokens/s range.

SpecExec inference on consumer GPUs with offloading, chat/instruct models, Llama-2-70B-GPTQ target model, $t=0.6$, OpenAssistant dataset

GPU Draft model Budget Gen. rate Speed, tok/s Speedup
RTX 4090 Llama2-7b GPTQ 256 13.46 5.66 8.3x
RTX 4060 Llama2-7b GPTQ 128 9.70 3.28 4.6x
RTX 3090 Llama2-7b GPTQ 256 14.3 3.68 10.6x
RTX 2080Ti ShearedLlama-1.3B 128 7.34 1.86 6.1x

Comparison to Sequoia

We recently released Sequoia, another speculative decoding method similarly aimed at speeding up LLM inference by generating optimized tree structures. Although SpecExec and Sequoia have similarities at a high level, they have different strengths and target different use cases. In particular, Sequoia is more amenable to on-chip inference due to its use of static trees, as they are easier to optimize with methods like CUDA Graphs and torch.compile. Furthermore, Sequoia can attain high acceptance rates across both high and low temperatures, whereas SpecExec shines at lower temperatures where the target model distribution is more “spiky”. SpecExec, on the other hand, targets the offloading regime, where the relative overhead of dynamic search is much smaller, and where using large and powerful draft models is practical. In this regime, it can reach faster speeds than Sequoia by dynamically constructing trees tailored to each input. As an example, when using a Llama-3 8B draft model for a Llama-3 70B target model (with offloading), Sequoia attains 2.2 tokens per second on average on the MT-Bench dataset, whereas SpecExec attains 2.8 tokens per second — a 27% improvement.

Conclusion

SpecExec represents a significant advancement in running large language models on consumer hardware. By leveraging the spikiness in token probability distributions and a capable draft model, it achieves impressive inference speedups, thus helping make LLMs more accessible and usable by a broader audience.

At Together, we are obsessed with making LLM inference as efficient as possible (among other things), through a combination of algorithmic and systems research. If you share our passion, and found SpecExec interesting, please contact us, or apply to open roles here!

LOREM IPSUM

Tag

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

$0.030/image

Try it out

LOREM IPSUM

Tag

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

$0.030/image

Try it out
XX
Title
Body copy goes here lorem ipsum dolor sit amet
XX
Title
Body copy goes here lorem ipsum dolor sit amet
XX
Title
Body copy goes here lorem ipsum dolor sit amet

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  

List Item  #1

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item  #1

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item  #1

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item  #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

List Item  #2

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

List Item  #3

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Build

Benefits included:

  • ✔ Up to $15K in free platform credits*

  • ✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

Grow

Benefits included:

  • ✔ Up to $30K in free platform credits*

  • ✔ 6 hours of free forward-deployed engineering time.

Funding: $5M-$10M

Scale

Benefits included:

  • ✔ Up to $50K in free platform credits*

  • ✔ 10 hours of free forward-deployed engineering time.

Funding: $10M-$25M

Multilinguality

Word limit

Disclaimer

JSON formatting

Uppercase only

Remove commas

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:

Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:

Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.

  • A. 2.08*1e-1 m
  • B. 2.08*1e-9 m
  • C. 2.08*1e-6 m
  • D. 2.08*1e-3 m

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:

Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by

  • A. releasing nitrogen in the soil.
  • B. crowding out non-native species.
  • C. adding carbon dioxide to the atmosphere.
  • D. removing water from the soil and returning it to the atmosphere.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:

Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:

Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?

Start
building
yours
here →