How Decagon Engineered Sub-Second Voice AI with Together AI

6×
cost reduction per turn compared to GPT-5 mini
<400ms
p95 model latency per turn on inputs up to tens of thousands of tokens
Weekly
model deployment velocity for rapid iteration

00:00

Executive Summary

Decagon uniquely builds conversational AI agents for concierge customer experiences, with voice as a core product surface. This raises the engineering bar significantly; in voice interfaces, silence reads as failure. Long pauses cause users to talk over the agent, hang up, or lose trust immediately.

To meet the strict latency budgets required for natural conversation, Decagon partnered with Together AI, the AI Native Cloud, to run and optimize production inference for its multi-model voice stack. This enabled Decagon to keep internal focus on model quality and orchestration while Together handled hosting, hardware access, and latency optimization.

About Decagon

Decagon builds conversational AI agents that empower businesses to deliver an AI concierge for every customer, 24/7, across voice, chat, and email. The platform’s signature Agent Operating Procedures (AOPs) help enterprises like Avis Budget Group, Chime, Oura Health, and Hunter Douglas map complex business workflows into natural language instructions, executed by a mesh of specialized models trained and fine-tuned in-house.

This approach is validated at scale, with Decagon’s AI agents having handled tens of millions of concierge interactions last year alone, with an industry-leading average deflection rate of more than eighty percent.

The challenge

Voice latency is audible Decagon’s leadership frames voice as the most demanding surface because latency is immediately perceptible. Long pauses create awkward moments, users talk over the agent, or they hang up assuming the system is broken. Decagon requires world-class latency, even when processing long conversational contexts spanning thousands of tokens.

Multi-model architecture increases serving complexity Decagon’s agent stack orchestrates multiple specialized models in real time—many trained in-house for specific tasks. This architecture increases capability, but it creates a more complex critical path where tail latency or instability in any one component can degrade the overall experience.

Guardrails must run without adding silence Enterprise deployments require strict adherence to corporate and regulatory protocols. Decagon’s research team carefully sequenced additional fine-tuned checker models to vet outputs without introducing additional delays.

Economics of always-on voice Decagon needed economics that support 24/7 voice deployments without relying on large closed-source models for every step in the pipeline. Their approach favored smaller, fine-tuned models plus serving efficiency at scale to reduce cost per query/turn while preserving quality.

Production iteration requires rapid deploy-test loops Decagon’s research team cycles through checkpoints frequently. The deployment path has to support pushing new variants into production-like conditions quickly so iteration doesn’t bottleneck on inference operations.

‍The solution

Decagon partnered with Together AI to run production inference for its multi-model voice stack. The collaboration focused on keeping the serving layer stable at voice-speed latency budgets and decoupling research iteration from infrastructure complexity.

To support the throughput and latency demands of real-time voice, the teams moved the majority of workloads onto NVIDIA HGX B200. This increased available compute headroom and reduced the risk that more complex model variants would push the system over strict latency thresholds. Decagon uses Together’s inference engine as the execution layer and describes Together’s end-to-end latency as leading other options they evaluated for voice workloads.

A central performance lever is speculative decoding. Decagon trains smaller draft models (“speculators”) that propose tokens ahead of the main model, with the larger model verifying those drafts. This allows the agent to begin speaking sooner while keeping output quality aligned to the main model. Decagon worked directly with Together to train custom speculators for their applications and credits speculative decoding as a primary driver of end-to-end latency improvements.

The stack also incorporates prompt caching and request-level optimizations to reduce repeated computation across multi-turn conversations and lower per-request overhead under load. Together helped tune deployment configurations for traffic volatility, including scaling ahead of known peaks and reducing startup time during unexpected surges through image and deployment optimization. Decagon also references a major AWS outage as a real-world surge event where traffic spiked as users redirected from unavailable services.

Results

‍Achieved significant cost reduction

Decagon achieved nearly 6× cost reduction per turn compared to closed models like GPT-5 mini. This economic improvement made 24/7 voice deployments viable at scale.

Met production latency and throughput targets

Decagon reduced p95 model latency per turn from seconds to <400ms on inputs up to tens of thousands of tokens. The production stack uses a suite of fine-tuned open-source models for core reasoning, served on NVIDIA Blackwell GPUs with high tensor parallelism, alongside speculative decoding with custom speculators and prompt caching.

Maintained stability under traffic surges

Decagon routinely handles spikes caused by outages; for example, the team maintained production stability during a major AWS outage in US-East that caused traffic to Decagon’s services to spike as users redirected from unavailable services.

Enabled high-velocity model iteration

Decagon ships models weekly, sometimes daily. Together's infrastructure allows the research team to deploy, test, and roll forward new checkpoints in production-like conditions rapidly.

Delivered a conversational voice experience

Customers describe the voice experience as helpful and conversational, with some explicitly thanking the agent after calls—qualitative feedback that reflects the latency and quality standards the infrastructure enables.

Use case details

Products used