Dedicated Container Inference

GPU infrastructure purpose-built for generative media workloads

Deploy video, audio, and avatar generation models on the AI Native Cloud.

Why Dedicated Container Inference
with Together AI?

Designed for production workloads that need  consistent performance and operational control.

Predictable pricing, leading unit economics

Leading performance lowers your effective GPU cost. Rapid autoscaling ensures you only pay for the capacity you use. Train on Together and deploy to dedicated containers without artifact transfer fees.

Research-backed speed

Hands-on partnership to profile and optimize your models. Achieve up to 2.6x speedup for production video generation workloads.

Made for massive surges

Proven elastic autoscaling during viral moments supports 10x plus traffic surges. Priority based queuing ensures paying customers never wait, even when free trial traffic spikes.

Key capabilities, purpose built for AI natives

From hardware to inference stack, every capability is optimized to get more out of every request.

    • Job-level monitoring

      Real-time job visibility
      Latency metrics
      Observability

      Monitor inference jobs, GPU utilization, queue depth, and latency metrics in real time. Full observability for production debugging.

    • Multi-cluster scaling

      10x traffic handling
      Zero-failure scaling
      Priority queuing

      Autoscale rapidly to handle 10x traffic surges with zero failures. Keep critical workloads running via priority-based queuing.

    • Multi-GPU orchestration

      No subprocess management
      Support for massive models

      Deploy video generation models across multiple GPUs with a simplified torchrun interface. No manual subprocess management required.

    • Simple, production-ready primitives

      Set up via SDK
      Custom config
      One line to deploy

      Configure infrastructure and dependencies in pyproject.toml, implement model logic via the Sprocket SDK, and push directly to GPU clusters. Bypass foundational orchestration setup completely.

Powered by leading research

With Dedicated Container Inference, teams benefit from our research pipeline from automatic kernel optimizations to hands-on profiling and tuning for workload-specific performance improvements.

  • Optimized Inference
  • together.compile
  • Throughput (TPS)

    • Together dedicated container inference
    • Previous provider

    Together AI vs other providers

    2.6x faster

    Significant speedup of production video generation models through profiling and optimization, when compared with the next previous provider.

    learn more
  • Server runtime

    • inference (17 IMAGE resolutions x 3)
    • Server start

    together.compile vs alternatives

    1.7x  faster

    together.compile automatically optimizes your kernel stack, delivering the fastest inference and near-instant server start and eliminating the tradeoff between compile-time optimization and cold start speed.

    learn more

Deployment options

Run models using different deployment options depending on latency needs, traffic patterns, and infrastructure control.

  • Serverless

  • Inference

Serverless Inference

Real-time

A fully managed inference API that automatically scales with request volume.

Best for

Variable or unpredictable traffic

Rapid prototyping and iteration

Cost-sensitive or early-stage production workloads

Batch

Process massive workloads of up to 30 billion tokens asynchronously, at up to 50% less cost.

Best for

Classifying large datasets

Offline summarization

Synthetic data generation

Dedicated Inference

Dedicated Model Inference

An inference endpoint backed by reserved, isolated compute resources and the Together AI inference engine.

Best for

Predictable or steady traffic

Latency-sensitive applications

High-throughput production workloads

Dedicated Container Inference

Run inference with your own engine and model on fully-managed, scalable infrastructure.

Best for

Generative media models

Non-standard runtimes

Custom inference pipelines

Production-grade
security and data privacy

We take security and compliance seriously, with strict data privacy controls to keep your information protected. Your data and models remain fully under your ownership, safeguarded by robust security measures.

Learn More

We take security and compliance seriously, with strict data privacy controls to keep your information protected. Your data and models remain fully under your ownership, safeguarded by robust security measures.

  • NVIDIA preferred partner
  • AICPA SOC 2 Type II

Customers running inference in production

    “Together AI’s infrastructure has the capacity to soak up our viral moments without breaking a sweat. During major traffic surges, Dedicated Container Inference scales seamlessly while maintaining performance. And because we trained on Together’s Accelerated Compute, deploying to production was frictionless—one platform, zero artifact transfers, no deployment headaches.”

    Terrance Wang

    Founding ML Engineer, Hedra

      "Infrastructure costs can kill an AI company as they scale. Together's Dedicated Container Inference handles unpredictable viral traffic without over-provisioning and unlocked significant speedups that directly improved our unit economics—without sacrificing quality. This let us focus on building products instead of managing clusters."

      Ledell Wu

      Co-Founder & Chief Research Scientist, Creatify