Dedicated Container Inference

GPU infrastructure purpose-built for generative media workloads. Deploy video, audio, and avatar generation models on the AI Native Cloud.

Production infrastructure engineered for generative media workloads

Dedicated Container Inference handles the complexity of deploying video, audio, and avatar generation models at scale.

Predictable, efficient pricing
Best unit economics for generative media workloads. Only pay for capacity you use.
Research-backed speed
Hands-on partnership to profile and optimize your models. Achieve 60% speedup for production video generation workloads.
Made for massive surges
Proven elastic autoscaling during viral moments. Priority-based queuing ensures paying customers never wait, even when free trial traffic spikes.
Job-level monitoring
Real-time visibility into inference jobs, GPU utilization, queue depth, and latency metrics. Full observability for production debugging.
Multi-cluster scaling
Quick autoscaling handles 10x traffic surges with zero failures. Priority-based queuing ensures paying customers never wait.
Multi-GPU orchestration
Deploy video generation models across multiple GPUs with simplified torchrun interface. No manual subprocess management required.

“Together AI’s infrastructure has the capacity to soak up our viral moments without breaking a sweat. During major traffic surges, Dedicated Container Inference scales seamlessly while maintaining performance. And because we trained on Together’s Accelerated Compute, deploying to production was frictionless—one platform, zero artifact transfers, no deployment headaches.”

‍

— Terrence Wang, Founding ML Engineer, Hedra

"Infrastructure costs can kill an AI company as they scale. Together's Dedicated Container Inference handles unpredictable viral traffic without over-provisioning and unlocked significant speedups that directly improved our unit economics—without sacrificing quality. This let us focus on building products instead of managing clusters."

‍

— Ledell Wu, Co-Founder & Chief Research Scientist, Creatify

Deploy custom models with simple, production-ready primitives

Use intuitive primitives to containerize your model, configure autoscaling, and deploy to dedicated GPU infrastructure — without building orchestration from scratch.

[project]
name = "video-inference"
version = "0.1.0"
dependencies = [
  "torch",
  "sprocket",
]

[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

import sprocket
class VideoSprocket(sprocket.Sprocket):
    def setup(self) -> None:
        # Load model - called once at startup
        self.model = load_your_model()
    def predict(self, args: dict) -> dict:
        input_path = args["input_path"]
        # Run inference
        result = self.model.generate(input_path)
        # Return output file for upload
        return {"url": sprocket.FileOutput("output.mp4")}
if __name__ == "__main__":
    sprocket.run(VideoSprocket(), "your-model-name")


# Build container and deploy to production jig deploy 
jig deploy

# Your model is now live with: 
# - Priority-based queuing 
# - Multi-cluster orchestration

View docs

Why choose Dedicated Container Inference

Build and scale generative media workloads with infrastructure engineered for performance, reliability, and profitability.

Powered by leading research
Achieved 2.6x speedup for production video generation models through profiling and optimization.

Generate video at real-time factor (RTF) of 10.
Unbeatable unit economics
Leading performance drives lower GPU pricing.
‍
Rapid GPU autoscaling ensures you only pay for what you use.
‍
No artifact transfer costs: Train on Together and deploy instantly to Custom Dedicated Inference.
PROVEN AT PRODUCTION SCALE
Multi-cluster orchestration and automatic workload distribution.
‍
Rapid autoscaling supports 10x+ traffic surges.

Pick the deployment that fits your needs

Run models using different deployment options depending on latency needs, traffic patterns, and infrastructure control.

Serverless Inference

Deploy models instantly with a fully managed API—no infrastructure setup required—and pay per token.

Real-time

A fully managed inference API that automatically scales with request volume.

Best for:

Variable or unpredictable traffic
Rapid prototyping and iteration
Cost-sensitive or early-stage production workloads

Get started Read the Docs

Batch

Process massive workloads of up to 30 billion tokens asynchronously, at up to 50% less cost.

Best for:

Classifying large datasets
Offline summarization
Synthetic data generation

Get started Read the Docs

Dedicated Inference

Reserved compute resources for your inference workloads with predictable, best-in-class performance and cost.

Dedicated Endpoints

An inference endpoint backed by reserved, isolated compute resources and the Together AI inference engine.

Best for:

Predictable or steady traffic
Latency-sensitive applications
High-throughput production workloads

Get started Read the Docs

Dedicated Container Inference

Run inference with your own engine and model on fully-managed, scalable infrastructure.

Best for:

Generative media models
Non-standard runtimes
Custom inference pipelines

Contact sales Read the Docs

Our content & guides

Learn more about Dedicated Container Inference and start building by following our guides.