Dedicated Container Inference
GPU infrastructure purpose-built for generative media workloads. Deploy video, audio, and avatar generation models on the AI Native Cloud.
Production infrastructure engineered for generative media workloads
Dedicated Container Inference handles the complexity of deploying video, audio, and avatar generation models at scale.
Predictable, efficient pricing
Best unit economics for generative media workloads. Only pay for capacity you use.
Research-backed speed
Hands-on partnership to profile and optimize your models. Achieve 60% speedup for production video generation workloads.
Made for massive surges
Proven elastic autoscaling during viral moments. Priority-based queuing ensures paying customers never wait, even when free trial traffic spikes.
Job-level monitoring
Real-time visibility into inference jobs, GPU utilization, queue depth, and latency metrics. Full observability for production debugging.
Multi-cluster scaling
Quick autoscaling handles 10x traffic surges with zero failures. Priority-based queuing ensures paying customers never wait.
Multi-GPU orchestration
Deploy video generation models across multiple GPUs with simplified torchrun interface. No manual subprocess management required.
Why choose Dedicated Container Inference
Build and scale generative media workloads with infrastructure engineered for performance, reliability, and profitability.
Powered by leading research
Achieved 2.6x speedup for production video generation models through profiling and optimization.
Generate video at real-time factor (RTF) of 10.Unbeatable unit economics
Leading performance drives lower GPU pricing.
Rapid GPU autoscaling ensures you only pay for what you use.
No artifact transfer costs: Train on Together and deploy instantly to Custom Dedicated Inference.
PROVEN AT PRODUCTION SCALE
Multi-cluster orchestration and automatic workload distribution.
Rapid autoscaling supports 10x+ traffic surges.
Pick the deployment that fits your needs
Run models using different deployment options depending on latency needs, traffic patterns, and infrastructure control.
Serverless Inference
Deploy models instantly with a fully managed API—no infrastructure setup required—and pay per token.
Real-time
A fully managed inference API that automatically scales with request volume.
Best for:
- Variable or unpredictable traffic
- Rapid prototyping and iteration
- Cost-sensitive or early-stage production workloads
Batch
Process massive workloads of up to 30 billion tokens asynchronously, at up to 50% less cost.
Best for:
- Classifying large datasets
- Offline summarization
- Synthetic data generation
Dedicated Inference
Reserved compute resources for your inference workloads with predictable, best-in-class performance and cost.
Dedicated Endpoints
An inference endpoint backed by reserved, isolated compute resources and the Together AI inference engine.
Best for:
- Predictable or steady traffic
- Latency-sensitive applications
- High-throughput production workloads
Dedicated Container Inference
Run inference with your own engine and model on fully-managed, scalable infrastructure.
Best for:
- Generative media models
- Non-standard runtimes
- Custom inference pipelines
Our content & guides
Learn more about Dedicated Container Inference and start building by following our guides.



