Optimizing inference speed and costs: Lessons learned from large-scale deployments
How can teams reduce inference latency without massive costs?
Achieving faster inference doesn't always mean paying more for a bigger cluster. At Together AI, we’ve seen teams that consistently deliver both low latency and low cost share these key habits:
- They maximize the usable work extracted from every GPU
- They actively eliminate invisible compute stalls
- They strategically select decoding techniques based on their specific traffic patterns
- They view performance tuning as an ongoing discipline, not a one-time configuration task
By excelling in these areas, your cluster can provide faster responses while simultaneously reducing the cost per token.
Why inference cost efficiency matters
AI products are getting more competitive by the week — and user expectations are rising just as fast.
For leading AI-native companies — like Cursor, who needs massive throughput without compromising speed, and Decagon who needs real-time responses despite unpredictable traffic patterns — the pressure is the same everywhere:
- Be fast. Sub-500ms TTFT and fast decoding speed
- Be predictable. No surprise tail latencies
- Be affordable. GPU bills can’t scale linearly with traffic
- Be ready for spikes. Because traffic never behaves the way you expect
Across customers, we consistently see the same imperative: deliver sub-second responses, without doubling the GPU bill. The good news? You don’t need exotic architectures or hundreds of extra GPUs to maintain inference cost efficiency.
Most teams get meaningful wins by optimizing how their inference runs, not purely how much hardware they buy.
How inference optimization works
Here are the levers that reliably move both speed and cost in the right direction.
1. Start at the model level: quantization and distillation
Quantization
Dropping precision (FP16 → FP8 → FP4) makes the model lighter on memory and faster to run — with virtually no quality loss when done well, like how we do it here at Together.
This unlocks:
- Noticeably faster tokens/sec
- Bigger batch sizes at the same GPU footprint
- Lower cost per token
- Smoother scaling for real-time workloads
We’ve seen in many production deployments that FP8 or FP4 quantization delivers 20–40% throughput improvement, without harming output quality.
Distillation
Not every workload needs the full weight of a frontier model. Distillation trains a smaller model to mimic a larger one, preserving reasoning patterns while cutting compute cost dramatically.
DeepSeek-R1 is a great example. Its distilled variants are fast, lightweight, and still excellent at reasoning — making them perfect for:
- Interactive chat
- Coding assistants
- Routing and classification
- High-volume enterprise workloads
- Inference at the edge or under tight latency budgets
You can see how teams deploy R1 and its distilled variants securely on Together AI in this post.
Distilled R1 variants deliver a quality-to-latency ratio that’s extremely compelling for production workloads — often enabling 2–5× lower cost at similar quality bands for many tasks.
Together, quantization and distillation offer some of the largest cost reductions available before touching hardware or cluster architecture.
2. Reduce network latency at the edge (regional inference proxies)
Sometimes the biggest latency win isn’t compute, but geography. Even with extremely fast models, network distance is often the slowest part of the request path.
Dropping a lightweight proxy in the same region as your inference cluster cuts out long round-trip paths before generation even starts.
This alone can shave 50–100 ms off TTFT, and make tail latency far more predictable.
3. Reduce unnecessary compute (memory stalls, KV inefficiencies, fragmentation)
Most models aren’t slow— the pipelines around them are. So your GPU spends a lot of time doing nothing and just… waiting. The biggest culprits tend to be:
- Kernels that don’t work together efficiently, forcing the GPU to pause between prefill, attention, and decoding
- MoE layers that spend more time waiting on memory than doing useful work, especially when expert routing is unbalanced
- Prefill paths that struggle with long prompts, leading to slow starts and uneven performance
- Batching or scheduling gaps that leave portions of the GPU idle while work is still available
At Together AI, we’ve run benchmarks across Llama, Qwen, Mistral, and DeepSeek families (highlighted in our fastest inference for the top open-source models blog) which show that kernel fusion, smarter MoE execution, streamlined tokenization, and better scheduling can eliminate wasted time, unlocking faster responses and higher throughput.
4. Use the right decoding optimization (MTP, speculative decoding, draft models)
Decoding is where a lot of time gets lost — and also where some of the easiest wins live.
- MTP: Predicts multiple tokens at once, increasing decode speed and GPU efficiency
- Speculative decoding: Uses a small “draft” model to accelerate generation for predictable workloads
- Traditional speculative decoding uses a fixed drafting strategy, but modern engines allow teams to optimize for their specific traffic distribution — maximizing speed while minimizing quality regressions. We did this with our own speculator, ATLAS.
- We break down these strategies in detail in our customized speculative decoding post.
When tuned properly, these techniques often deliver 20–50% faster decoding and significantly higher throughput per GPU.
5. Pick the right hardware for your workload (and use the right parallelism)
With a new hardware type that comes out every year or so, hardware choice increasingly shapes both cost and latency.
- Blackwell GPUs offer major improvements in per-token throughput and attention kernel speed.
- NVIDIA Grace Blackwell (GB200) systems tightly pair CPU + GPU memory, reducing data movement overhead and boosting throughput for large batch sizes and long contexts.
But to fully benefit from this hardware, large models need to be split and scheduled intelligently across devices. That’s where parallelism strategies come in:
- Tensor parallelism splits individual layers across GPUs, letting very large models run efficiently without becoming memory-bound.
- Expert parallelism distributes different experts in MoE models across GPUs, so each GPU specializes in a subset of experts instead of doing everything.
Teams running billions of tokens per day typically see a clear drop in cost-per-token when moving heavy workloads to NVIDIA Blackwell-class hardware with the right parallelism strategy.
6. Dynamically shift GPU capacity across endpoints
Traffic is rarely evenly distributed across all services. For example:
- Feature: Dynamic scaling between endpoints based on real-time concurrency and demand
- How it works: GPUs automatically reassign to the busiest endpoints, while idle endpoints relinquish capacity.
- Outcome: Higher total utilization, fewer idle GPUs, and the ability to handle spikes without overprovisioning.
This is especially valuable for customers with mixed workloads: coding, chat, RAG, batch, and long-form generation. Together AI allows customers to update capacity on their endpoints via a simple API call.
What you can do with better inference optimization
Teams that implement these optimizations unlock:
- Lower TTFT and faster decoding
- Higher GPU utilization and fewer idle cycles
- Reduced cost per token
- Improved predictability and tail latencies
- Better user experience across interactive and real-time products
Getting started
Here’s a practical, low-friction way to begin:
- Measure your baseline (TTFT, decode TPS, TPM/GPU, network RTT)
- Deploy a regional proxy if requests originate far from your inference cluster
- Enable adaptive continuous batching and monitor GPU utilization
- Turn on MTP or speculative decoding depending on workload
- Rebalance endpoints by dynamically shifting GPU capacity as traffic changes
FAQ
Do throughput optimizations increase latency?
No. Continuous batching + fused kernels let you raise throughput while lowering latency.
Is NVIDIA Blackwell only for huge workloads?
No. Any workload with meaningful concurrency or long contexts benefits from its bandwidth and memory improvements.
How can I tell if my GPUs are under-utilized?
Look for low decode TPS, small active batch sizes, or long gaps between token generations.
LOREM IPSUM
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
LOREM IPSUM
Audio Name
Audio Description
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
List Item #2
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
List Item #3
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Build
Benefits included:
✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.
Funding: Less than $5M
Grow
Benefits included:
✔ Up to $30K in free platform credits*
✔ 6 hours of free forward-deployed engineering time.
Funding: $5M-$10M
Scale
Benefits included:
✔ Up to $50K in free platform credits*
✔ 10 hours of free forward-deployed engineering time.
Funding: $10M-$25M
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:
Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:
Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.
- A. 2.08*1e-1 m
- B. 2.08*1e-9 m
- C. 2.08*1e-6 m
- D. 2.08*1e-3 m
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:
Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by
- A. releasing nitrogen in the soil.
- B. crowding out non-native species.
- C. adding carbon dioxide to the atmosphere.
- D. removing water from the soil and returning it to the atmosphere.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:
Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:
Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?
article