Together AI delivers fastest inference for the top open-source models
Learn how we achieved up to 2x faster serverless inference for the most demanding LLMs, including GPT-OSS, Qwen, Kimi and DeepSeek.
Together AI achieves fastest inference for leading open-source models
Together AI now delivers up to 2x faster serverless inference for demanding open-source LLMs, ranking #1 in output speed benchmarks. The performance breakthrough comes from coordinated improvements across next-gen GPU hardware, optimized kernels, near-lossless quantization, and production-grade speculative decoding with custom-trained draft models. Key innovations include architecture-aware calibration, adaptive acceptance strategies, and scalable training pipelines supporting models up to 1T+ parameters, delivering breakthrough speed for models like GPT-OSS, Qwen3, Kimi-K2, DeepSeek-R1, and DeepSeek-V3.1.
Over the past few months, our team has been laser-focused on one goal: making our inference platform the fastest place to run the world’s best open-source models. Today, we're excited to share results that speak for themselves.
Across multiple independent benchmarks from Artificial Analysis, our platform now consistently ranks #1 in output speed among GPU-based providers for the most demanding open-source models—including GPT-OSS-20B, GPT-OSS-120B, Qwen-3-235B-Instruct, Qwen-3-Coder-480B, Kimi-K2-Instruct, DeepSeek-R1, and DeepSeek-V3.1. On several models, we’re now delivering up to 2x better output speed than competing providers.
%20%20(1).png)
.png)
%20%20(3).png)
.png)
.png)
.png)
%20.png)
Performance at this level doesn’t come from a single change — it’s the result of coordinated improvements across hardware, kernels, runtime engine tuning, speculative decoding, and draft model training pipeline. In this post, we’ll share the story behind how we achieved it.
1. Next-Gen GPU hardware with engine optimization.
A major portion of our performance gains comes from a fully modernized inference engine built to exploit the latest GPU hardware, optimized kernels, and emerging low-bit quantization formats such as FP4. Instead of optimizing isolated layers, we re-architected the entire stack—compute kernels, memory layout, execution graphs, and scheduling—to work together as a unified high-efficiency system.
Hardware-aware execution on the latest GPUs
Our engine is tuned specifically for the NVIDIA Blackwell architecture, including the NVIDIA GB200 NVL72. This includes optimized paths for low-precision compute (FP8, FP4), high-bandwidth data movement, and near-zero-overhead scheduling that maximizes utilization across all compute tiers. We don’t just run on fast hardware—we structure execution around it to extract its full capability in real workloads.
Together Kernels
We built and integrated a new generation of high-performance GPU kernels designed for NVIDIA Blackwell architecture, enabling us to fully leverage massive bandwidth. This includes our optimized FlashAttention-4 kernels, fused MoE kernels that combine routing and expert FFNs, and more. Together, these hardware-aware kernels dramatically improve throughput for large models and are a key driver of the performance uplift we now see across real-world workloads.
2. Turbo optimization suite
Quantization
A key part of our speed gains comes from our ability to quantize large model weights to low-bit formats—FP8, FP4 (nvfp4 or mxfp4), and hybrid precision—while remaining effectively lossless in practice. Our pipeline performs architecture-aware calibration, fine-grain block-wise scaling, and selective mixed-precision on sensitive paths, allowing us to retain target-model quality even at extreme compression levels. Combined with a runtime built for low-bit execution—including fused FP4/FP8 kernels, quantized KV-cache, and Blackwell-optimized memory layouts—we achieve major latency and throughput improvements without sacrificing accuracy. This near-lossless quantization capability is foundational to the up to faster inference speeds we now deliver across the largest open-source models.
Speculator algorithm
One of the biggest unlocks for our recent performance leap has been our work on production-grade speculative decoding algorithms.
Speculative decoding is not new, but making it reliable across data domains and consistently faster in a multi-tenant serverless environment is extremely difficult. Our implementation includes:
- Training-efficient algorithms that enable us to get higher performance per training flops.
- Training high-accuracy draft models optimized specifically for each target model.
- Adaptive acceptance strategies that maximize speed while preserving output quality.
- Fail-safe fallback mechanisms that keep latency predictable under load.
This unlocks substantial gains—especially for models like Kimi or Qwen3, where our SpecDec stack provides nearly double the output speed. Check our ATLAS blog and the world’s fastest Blackwell inference for more details.
Large-scale speculator training
To support the largest modern LLMs, we built a fully scalable draft-model training pipeline. This is the backbone that lets us deploy high-quality speculative decoders for models that do not come with off-the-shelf speculators, as well as improving upon existing speculators.
Our company developed:
- Scalable training framework that supports high-performance speculator algorithms for target models as large as 1T parameters and beyond.
- Curriculum-based training, post-training recipe, and data-mixing strategies for draft models to match the target model’s stylistic and structural outputs.
- Alignment evaluation framework to test and iterate draft model quality with respect to target models.
- High-performance architectures and pre-trained base models that can be adapted as speculators for many target models.
The result: draft models that achieve high acceptance rates and speed, resulting in the world’s fastest inference speed on NVIDIA Blackwell architecture.
What’s Next?
We’re committed to making open-source AI models not only accessible, but also leading performance and planet-level scalability. These latest benchmarks are a milestone—but not the finish line.
We’re already working on:
- Even faster generation for downstream domains.
- New generation strategies beyond speculative decoding.
- Expanded support for hybrid quantization.
And of course, continuing to push inferencing performance forward.
Contact us
If you are interested in exploring NVIDIA GB200 NVL72 or other Blackwell GPUs for your workloads, or would like to learn more about how our world-class inference optimization works, we invite you to get in touch with our Customer Experience team.
LOREM IPSUM
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
LOREM IPSUM
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
List Item #2
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
List Item #3
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Build
Benefits included:
✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.
Funding: Less than $5M
Grow
Benefits included:
✔ Up to $30K in free platform credits*
✔ 6 hours of free forward-deployed engineering time.
Funding: $5M-$10M
Scale
Benefits included:
✔ Up to $50K in free platform credits*
✔ 10 hours of free forward-deployed engineering time.
Funding: $10M-$25M
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:
Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:
Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.
- A. 2.08*1e-1 m
- B. 2.08*1e-9 m
- C. 2.08*1e-6 m
- D. 2.08*1e-3 m
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:
Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by
- A. releasing nitrogen in the soil.
- B. crowding out non-native species.
- C. adding carbon dioxide to the atmosphere.
- D. removing water from the soil and returning it to the atmosphere.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:
Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:
Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?
article