Dedicated Model Inference

Deploy models on dedicated infrastructure, engineered for speed

Purpose-built for teams who need control and the best economics in the market.

Why Dedicated Inference
with Together AI?

Designed for production workloads that need 
consistent performance and operational control.

Built for production inference

Scale to thousands of GPUs for always-on, production inference deployments.

Industry-leading unit economics

We provide the fastest deployments, enabling best price-performance on top GPUs.

Powered by frontier AI systems research

We continuously roll out the latest innovations to keep your deployments running fast.

Research that ships

Our research team doesn't just publish. They build the optimizations that power every inference request.

  • ATLAS
  • CPD
  • Megakernel
  • ThunderKittens
  • Performance on DeepSeek V3.1 (Arena Hard)

    • Atlas
    • Static Speculator
    • No Speculator

    ATLAS performance

    3.18x faster

    ATLAS, our AdapTive-LeArning Speculator System, continuously learns from live traffic — outperforming static speculators and specialized hardware.

    learn more
  • CPD improves sustainable QPS by 35-40%

    • CPD
    • Baseline

    Together AI CPD vs 2P1D

    +40% throughput

    Long-context inference without the latency penalty. CPD (cache-aware prefill-decode disaggregation) separates warm and cold requests, cutting time-to-first-token and boosting throughput by up to 40%.

    learn more
  • Time to first 64 tokens

    • Megakernel (H100)
    • Baseline (B200)

    Megakernel vs baseline

    Up to 3.6x faster

    Megakernel fuses an entire model's forward pass into a single GPU kernel. Made using the ThunderKittens framework, Megakernel eliminates the idle gaps between operations that rob GPUs of their full potential.

    learn more
  • BF16 all-reduce sum performance (on 8x NVIDIA B200s)

    • PK
    • NCCL

    ParallelKittens vs NCCL

    Up to 1.79x faster

    ParallelKittens—an extension to ThunderKittens for multi-GPU workloads developed in collaboration with Stanford's Hazy Lab—cuts the synchronization overhead that large multi-GPU models pay on every single forward pass.

    learn more

Build with leading models

Explore top-performing models across text, image, video, code, and voice.

Chat

MiniMax M2.5

Chat

Kimi K2.5

new

Chat

GLM-5.1

new

Chat

Gemma 4 31B

New

Chat

MiniMax M2.7

Chat

gpt-oss-120B

new

Chat

LFM2 24B A2B

Chat

Qwen3.5-397B-A17B

Chat

GLM-5

Chat

Qwen3-Coder-Next

Image

Wan 2.6 Image

Image

GPT Image 1.5

New

Chat

Qwen3.5 9B

Chat

Gemma 3 27B

Chat

Llama 4 Maverick

Chat

Qwen3 235B A22B Instruct 2507 FP8

Video

Google Veo 3.0

Image

FLUX.2 [pro]

Chat

NIM Llama 3.1 Nemotron 70B Instruct

Chat

Kimi K2 Instruct-0905