Dedicated Model Inference

Deploy models on dedicated infrastructure, engineered for speed

Purpose-built for teams who need control and the best economics in the market.

Why Dedicated Inference
with Together AI?

Designed for production workloads that need 
consistent performance and operational control.

Built for production inference

Scale to thousands of GPUs for always-on, production inference deployments.

Industry-leading unit economics

We provide the fastest deployments, enabling best price-performance on top GPUs.

Powered by frontier AI systems research

We continuously roll out the latest innovations to keep your deployments running fast.

Build with leading models

Explore top-performing models across text, image, video, code, and voice.

New

Image

Nano Banana Pro (Gemini 3 Pro Image)

new

Chat

GLM-5

New

Chat

Kimi K2.5

New

Chat

gpt-oss-120B

New

Code

DeepSeek-V3.2-Exp

new

Chat

Ministral 3 8B Instruct 2512

new

Audio

MiniMax Speech 2.6 Turbo

new

Chat

LFM2 24B A2B

new

Code

Qwen3-Coder-Next

new

Image

Wan 2.6 Image

new

Image

GPT Image 1.5

Chat

Gemma 3 27B

Chat

Llama 4 Maverick

Chat

Qwen3 235B A22B Instruct 2507 FP8

Video

Google Veo 3.0

Image

FLUX.2 [pro]

Chat

NIM Llama 3.1 Nemotron 70B Instruct

New

Chat

Kimi K2 Instruct-0905

New

Video

Sora 2 Pro

new

Chat

Arcee AI Trinity Mini

Have your own model?

Deploy custom containers on Together’s managed GPU infrastructure with automatic scaling, job queues, and built-in observability.

Key capabilities, purpose built for AI natives

Scale from self-serve instant clusters to thousands of GPUs, all optimized for better performance with Together Kernel Collection.

    • Adaptive speculative decoding

      Faster Outputs
      Learns in production
      Lossless quality

      Cut latency on dedicated infrastructure with ATLAS — Together's AdapTive-LeArning Speculative System. Predict and validate multiple tokens per step to accelerate workloads continuously. No decoding bottlenecks.

    • Deploy in minutes

      NO DEVOPS REQUIRED
      LIVE IN MINUTES
      SIMPLE CONFIGURATION

      Launch dedicated endpoints in minutes by selecting a target model and hardware configuration. Establish production-ready inference environments without requiring deep infrastructure expertise.

    • Bring your own language model

      BRING ANY MODEL
      DEPLOY IN MINUTES
      UI OR CLI

      Deploy custom models directly from Hugging Face or S3 onto dedicated endpoints via the UI or CLI. Maintain complete ownership while offloading infrastructure management.

Research that ships

Our research team doesn't just publish. They build the optimizations that power every inference request.

  • ATLAS
  • CPD
  • Megakernel
  • ThunderKittens
  • Performance on DeepSeek V3.1 (Arena Hard)

    • Atlas
    • Static Speculator
    • No Speculator

    ATLAS performance

    3.18x faster

    ATLAS, our AdapTive-LeArning Speculator System, continuously learns from live traffic — outperforming static speculators and specialized hardware.

    learn more
  • CPD improves sustainable QPS by 35-40%

    • CPD
    • Baseline

    Together AI CPD vs 2P1D

    +40% throughput

    Long-context inference without the latency penalty. CPD (cache-aware prefill-decode disaggregation) separates warm and cold requests, cutting time-to-first-token and boosting throughput by up to 40%.

    learn more
  • Output speed

    • Megakernel (H100)
    • Baseline (B200)

    Megakernel vs baseline

    Up to 3.6x faster

    Megakernel fuses an entire model's forward pass into a single GPU kernel. Made using the ThunderKittens framework, Megakernel eliminates the idle gaps between operations that rob GPUs of their full potential.

    learn more
  • BF16 all-reduce sum performance (on 8x NVIDIA B200s)

    • PK
    • NCCL

    ParallelKittens vs NCCL

    Up to 1.79x faster

    ParallelKittens—an extension to ThunderKittens for multi-GPU workloads developed in collaboration with Stanford's Hazy Lab—cuts the synchronization overhead that large multi-GPU models pay on every single forward pass.

    learn more

Deployment options

Run models using different deployment options depending on latency needs, traffic patterns, and infrastructure control.

  • Serverless

  • Inference

Serverless Inference

Real-time

A fully managed inference API that automatically scales with request volume.

Best for

Variable or unpredictable traffic

Rapid prototyping and iteration

Cost-sensitive or early-stage production workloads

Batch

Process massive workloads of up to 30 billion tokens asynchronously, at up to 50% less cost.

Best for

Classifying large datasets

Offline summarization

Synthetic data generation

Dedicated Inference

Dedicated Model Inference

An inference endpoint backed by reserved, isolated compute resources and the Together AI inference engine.

Best for

Predictable or steady traffic

Latency-sensitive applications

High-throughput production workloads

Dedicated Container Inference

Run inference with your own engine and model on fully-managed, scalable infrastructure.

Best for

Generative media models

Non-standard runtimes

Custom inference pipelines

Production-grade
security and data privacy

We take security and compliance seriously, with strict data privacy controls to keep your information protected. Your data and models remain fully under your ownership, safeguarded by robust security measures.

Learn More

We take security and compliance seriously, with strict data privacy controls to keep your information protected. Your data and models remain fully under your ownership, safeguarded by robust security measures.

  • NVIDIA preferred partner
  • AICPA SOC 2 Type II

Customers running inference in production

  • cost reduction

  • <400ms

    p95 model latency

  • Weekly

    model deployments

"Low latency is especially important for voice because there’s a much higher UX bar. Together helped us push latency down by optimizing our models with techniques like speculative decoding, and they’ve been a reliable production partner — proactive about risks and fast when issues come up."

Max Lu

Head of Research, Decagon

  • ~30%

    Cost savings

"Together has helped us deploy VyUI, our state-of-the-art computer AI model. We had multiple in-depth meetings where we brainstormed how we could satisfy our model's custom technical requirements while still leveraging Together's infrastructure for efficient, load-balanced inference."

Luca Weihs

Co-founder, Vercept

    "Together AI offers optimized performance at scale, and at a lower cost than closed-source providers – all while maintaining strict privacy standards."

    Vineet Khosla

    CTO, The Washington Post