Benchmarking inference at scale: coding agents

Summary

Together Inference Engine delivers over 50% more TPS than TensorRT-LLM on the same hardware and 2× better TTFT at saturation, on a benchmark drawn from real production coding agent traffic. SGLang required twice the GPUs and still lost on both. Kimi K2.6, available now on Together, matches Claude Opus 4.6 on coding benchmarks at 76% lower cost per request.

Most inference benchmarks measure a single user hitting a dedicated endpoint. The numbers look great. They're also useless for reasoning about production.

In production, you're running dozens or hundreds of concurrent requests. They compete for the same KV cache, the same memory bandwidth, the same GPU cycles. What matters is what happens to every user when the system is under load.

We built this benchmark to answer that question for coding agents. It's a workload that hits inference hard: long inputs, high concurrency, and no tolerance for latency degradation under load.

This is version one. We'll update it as we build.

What a coding agent workload looks like

Coding agent requests carry a lot of context. The file being edited, surrounding code, conversation history, retrieved snippets. Inputs are long. Outputs are meaningful but bounded; you're generating a function, not an essay.

The harder challenge is concurrency. Many users hit the endpoint simultaneously, and those requests interact in ways single-user benchmarks never capture. As traffic increases, KV cache fills. Scheduling pressure mounts. Per-user throughput drops. Time to first token (TTFT) climbs. At some point, the system stops being useful. Different engines reach that point at very different traffic levels.

We designed a high-traffic benchmark to stress-test this, modeled on the request distributions we see serving production coding agent traffic at scale. Prompts are long — median 81k tokens, p90 173k tokens — with generation lengths averaging around 450 tokens. The key metrics are TPM (input tokens per minute), TPS (tokens per second) per user, and p50 TTFT.

Methodology

Hardware: 4× NVIDIA B200 per engine (SGLang: 8× B200 — see note below).

Workload: Prompt lengths follow a realistic coding agent distribution — mean 87.9k tokens, p50 81.5k, p90 173k, p99 204k. Generation lengths average 450 tokens (p50: 293, p99: 2,230). Difficulty scales with traffic level: at higher QPS, longer prompts and growing KV caches create more prefill pressure, more context to maintain, and more KV cache thrashing as session churn increases.

EAGLE speculative decoding: 3 draft tokens. Acceptance rate (~70%) emerges naturally from the realistic synthetic prompt data — we're not forcing it.

Engine configs: TRT is well-tuned for this workload and represents a strong baseline. SGLang was configured to match where possible; we didn't run exhaustive tuning experiments, so there may be marginal room for improvement. All engines are configured for low latency. This is distinct from a throughput-optimized config, which would increase max decode batch size and use prefill-decode disaggregation to trade output TPS for higher input TPM.

What we optimized

Our performance gains came from treating inference as a full-stack problem: profiling end-to-end, identifying the most expensive operations, and eliminating them one by one.

ThunderMLA. Kimi K2.5 uses DeepSeek's Multi-head Latent Attention (MLA) architecture. Standard implementations run two separate kernel launches per decode step. Our ThunderMLA — part of our ThunderKittens kernel library — fuses these into a single megakernel, eliminating launch overhead and the tail effects between them. On representative decode workloads, ThunderMLA is 20–35% faster than DeepSeek's own FlashMLA.

Beyond ThunderMLA, we profiled the full stack — driver behavior, memory layout, kernel execution — and removed every bottleneck we found. Some required configuration changes. Others required writing kernels from scratch. The kernels we wrote outperform TensorRT-LLM's open-source equivalents on this workload.

Here's how that translates to the full system under load.

Results

We compared Together Inference Engine against two baselines on Kimi K2.5 with EAGLE speculative decoding:

TensorRT-LLM — 4 x NVIDIA B200 GPUs
SGLang — 8 x NVIDIA B200 GPUs

A note on SGLang: Running Kimi K2.5 with EAGLE on SGLang at TP4 ran out of memory — SGLang's EAGLE implementation requires more memory than TRT's on this model. We used TP8 (8 GPUs) to run it. TRT and Together Inference Engine ran on 4 GPUs.

At 1.5M TPM, Together Inference Engine delivers over 50% more TPS than TRT while maintaining sub-second TTFT.

The degradation curve

The shape of the curve matters more than any single data point. Every inference engine eventually saturates: KV cache fills, scheduling pressure increases, TTFT climbs. What differs between engines is when that happens and how fast.

At 2.3M TPM, every engine is past its comfortable range:

Engine	GPUs	p50 TTFT
Together IE	4	2.87s
TRT	4	5.72s
SGLang	8	9.1s

At the traffic level where all engines are degrading, Together IE's TTFT is 2× better than TRT's and 3× better than SGLang's. The system has more headroom: functional at loads where other engines are not.

Cost and quality

The performance benchmarks in this post are on Kimi K2.5. Kimi K2.6 is now available on Together, and on coding benchmarks it matches or beats Claude Opus 4.6 across the board.

Benchmark	Kimi K2.6	Claude Opus 4.6
SWE-Bench Verified	80.2	80.8
SWE-Bench Pro	58.6	53.4
LiveCodeBench v6	89.6	88.8
Terminal-Bench 2.0	66.7	65.4

At that quality level, the cost difference is significant. For a typical request on this workload — 87.9k input tokens, ~450 output tokens:

Model	Cost per request
Kimi K2.6 on Together	$0.108
Claude Opus 4.6	$0.451

76% cheaper per request. A 30-person engineering team running a coding agent at 1.5M TPM for 5 hours a day (250 working days) saves ~$440K/year on inference costs vs. Claude Opus 4.6.

This is version one

These results reflect where Together Inference Engine stands today, on this workload, on this hardware configuration. We're publishing them because we think benchmarks should be meaningful: based on real workload shapes, transparent about methodology, and honest about where things start to break.

Each update will be additive. The goal is a running record of what optimization actually buys you on a workload you can reason about. When the next one ships, we'll show you exactly what changed and why the numbers moved.

If you're running a coding agent at scale and want to understand what this means for your workload, reach out.

‍

DeepSeek R1

Premium cinematic video generation with native audio and lifelike physics.

$2.40

Try now

DeepSeek R1

Audio Name

Audio Description

0:00

Premium cinematic video generation with native audio and lifelike physics.

$2.40

Try now

DeepSeek R1

Premium cinematic video generation with native audio and lifelike physics.

$2.40/video (720p/8s)

Try now

Performance & Scale

Body copy goes here lorem ipsum dolor sit amet

Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum

Infrastructure

Best for

Faster processing speed (lower overall query latency) and lower operational costs
Execution of clearly defined, straightforward tasks
Function calling, JSON mode or other well structured tasks

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Build

Benefits included:

✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

Build

Benefits included:

✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

Build

Benefits included:

✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:

‍Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:

Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:

Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.

A. 2.08*1e-1 m
B. 2.08*1e-9 m
C. 2.08*1e-6 m
D. 2.08*1e-3 m

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:

Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by

A. releasing nitrogen in the soil.
B. crowding out non-native species.
C. adding carbon dioxide to the atmosphere.
D. removing water from the soil and returning it to the atmosphere.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:

Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:

Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?

Title

Body copy goes here lorem ipsum dolor sit amet

Title

Body copy goes here lorem ipsum dolor sit amet

Title

Body copy goes here lorem ipsum dolor sit amet

DeepSeek R1

Premium cinematic video generation with native audio and lifelike physics.

$2.40

Try now

DeepSeek R1

Audio Name

Audio Description

0:00

Premium cinematic video generation with native audio and lifelike physics.

$2.40

Try now

DeepSeek R1

Premium cinematic video generation with native audio and lifelike physics.

$2.40/video (720p/8s)

Try now

Performance & Scale

Body copy goes here lorem ipsum dolor sit amet

Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum

Infrastructure

Best for

Faster processing speed (lower overall query latency) and lower operational costs
Execution of clearly defined, straightforward tasks
Function calling, JSON mode or other well structured tasks

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item #1

Build

Benefits included:

✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

Build

Benefits included:

✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

Build

Benefits included:

✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

‍Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

A. 2.08*1e-1 m
B. 2.08*1e-9 m
C. 2.08*1e-6 m
D. 2.08*1e-3 m

Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by

A. releasing nitrogen in the soil.
B. crowding out non-native species.
C. adding carbon dioxide to the atmosphere.
D. removing water from the soil and returning it to the atmosphere.

Title

Body copy goes here lorem ipsum dolor sit amet

Title

Body copy goes here lorem ipsum dolor sit amet

Title

Body copy goes here lorem ipsum dolor sit amet