Dedicated Container Inference

GPU infrastructure purpose-built for generative media workloads. Deploy video, audio, and avatar generation models on the AI Native Cloud.

Production infrastructure engineered for generative media workloads

Dedicated Container Inference handles the complexity of deploying video, audio, and avatar generation models at scale.

  • Predictable, efficient pricing

    Best unit economics for generative media workloads. Only pay for capacity you use.

  • Research-backed speed

    Hands-on partnership to profile and optimize your models. Achieve 60% speedup for production video generation workloads.

  • Made for massive surges

    Proven elastic autoscaling during viral moments. Priority-based queuing ensures paying customers never wait, even when free trial traffic spikes.

  • Job-level monitoring

    Real-time visibility into inference jobs, GPU utilization, queue depth, and latency metrics. Full observability for production debugging.

  • Multi-cluster scaling

    Quick autoscaling handles 10x traffic surges with zero failures. Priority-based queuing ensures paying customers never wait.

  • Multi-GPU orchestration

    Deploy video generation models across multiple GPUs with simplified torchrun interface. No manual subprocess management required.

Deploy custom models with simple, production-ready primitives

Use intuitive primitives to containerize your model, configure autoscaling, and deploy to dedicated GPU infrastructure — without building orchestration from scratch.

[project]
name = "video-inference"
version = "0.1.0"
dependencies = [
  "torch",
  "sprocket",
]

[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"
import sprocket
class VideoSprocket(sprocket.Sprocket):
    def setup(self) -> None:
        # Load model - called once at startup
        self.model = load_your_model()
    def predict(self, args: dict) -> dict:
        input_path = args["input_path"]
        # Run inference
        result = self.model.generate(input_path)
        # Return output file for upload
        return {"url": sprocket.FileOutput("output.mp4")}
if __name__ == "__main__":
    sprocket.run(VideoSprocket(), "your-model-name")

# Build container and deploy to production jig deploy 
jig deploy

# Your model is now live with: 
# - Priority-based queuing 
# - Multi-cluster orchestration

Why choose Dedicated Container Inference

Build and scale generative media workloads with infrastructure engineered for performance, reliability, and profitability.

Pick the deployment that fits your needs

Run models using different deployment options depending on latency needs, traffic patterns, and infrastructure control.

Serverless Inference

Deploy models instantly with a fully managed API—no infrastructure setup required—and pay per token.

Real-time

A fully managed inference API that automatically scales with request volume.

Best for:

  • Variable or unpredictable traffic
  • Rapid prototyping and iteration
  • Cost-sensitive or early-stage production workloads

Batch

Process massive workloads of up to 30 billion tokens asynchronously, at up to 50% less cost.

Best for:

  • Classifying large datasets
  • Offline summarization
  • Synthetic data generation

Dedicated Inference

Reserved compute resources for your inference workloads with predictable, best-in-class performance and cost.

Dedicated Endpoints

An inference endpoint backed by reserved, isolated compute resources and the Together AI inference engine.

Best for:

  • Predictable or steady traffic
  • Latency-sensitive applications
  • High-throughput production workloads

Dedicated Container Inference

Run inference with your own engine and model on fully-managed, scalable infrastructure.

Best for:

  • Generative media models
  • Non-standard runtimes
  • Custom inference pipelines

Our content & guides

Learn more about Dedicated Container Inference and start building by following our guides.