Introducing Dedicated Container Inference: Delivering 2.6x faster inference for custom AI models
Summary
Dedicated Container Inference lets teams deploy custom generative media models — like video generation, avatar synthesis, and image processing — with production-grade orchestration they don't have to build themselves. Bring your Docker container; Together handles autoscaling, queuing, traffic isolation, and monitoring. For teams already training on Together's GPU Cloud, going from training to production requires zero artifact transfers. Customers like Creatify and Hedra have seen 1.4x–2.6x inference speedups, driven by both the platform architecture and hands-on optimization from Together's research team.
Since Together’s inception, we have powered large-scale LLM inference. We understand this space deeply and have built a product to optimize for stateless requests, token-based latency, and highly optimized serving paths.
Dedicated Container Inference is built for a different class of workloads.
Over the last year, we have worked closely with teams deploying custom, non-LLM models into production. These go far beyond text-in, text-out APIs: video generation pipelines, avatar synthesis, large-scale image processing, and custom audio/media models with real business constraints.
What these teams consistently needed was not just GPUs or containers. They needed a way to run custom inference code in production without building their own job orchestration layer. Autoscaling, queuing, traffic isolation, retries, and monitoring all mattered, but none of them wanted to reimplement that stack in-house.
Dedicated Container Inference is our answer to that gap - bringing production-grade orchestration for custom models to the AI Native Cloud.
You bring your container and inference logic. We handle deployment, autoscaling, queuing, and monitoring at the job level, built specifically for GPU-intensive workloads.
How is this different?
Most inference platforms optimize around a single abstraction, either a stateless endpoint or a large batch job. That works until you have real-time and batch traffic, customer tiers, and sudden demand spikes at the same time.
Dedicated Container Inference is built around job orchestration for your container which enables:
- Multiple independent queues instead of a single FIFO stream
- Policy-driven traffic control rather than per-request priority
- Isolation between batch, real-time, and untrusted traffic
- Predictable behavior during spikes without over-provisioning
The difference though, is not just at inference time. Together is an end-to-end training-to-inference platform. Models trained on Together’s GPU Cloud can be deployed directly as Dedicated Containers without any artifact transfers or additional work.
For teams building custom models, this tight loop reduces operational overhead and makes it easier to move from training to production without introducing new failure modes.
Architecture at a glance
Dedicated Container Inference is built on a container-based deployment framework where jobs and queues are first-class concepts. Instead of forcing inference into a single request-response shape, we treat your container as the unit of execution and manage everything around it.

At a high level:
You package your model as a Docker container
The container includes your runtime, dependencies, and inference code. You decide how inference runs and what libraries you use.
We deploy and manage that container on GPU infrastructure
Together provisions GPUs, launches replicas, handles networking, health checks, and monitoring. You do not manage clusters or nodes directly. For large models that require multiple GPUs, we provide built-in support for distributed inference via torchrun.
Volume mounts for model weights
Rebuilding a 50GB container every time you update weights is slow and expensive. With volume mounts, you upload weights once and attach them at deploy time.
Inference runs as jobs
Requests are queued and executed by workers pulled from your deployment. This supports long-running jobs, batch workloads, and mixed traffic patterns.
Autoscaling driven by queue depth or metric of choice
Scale capacity up or down based on utilization, queue length, weighted queue priority, and job features like video length, or target wait time.
Traffic policies are explicit
You can define multiple queues and control priority by customer tier, use case, or SLA. Batch workloads do not interfere with real-time requests, and paid users are protected during spikes.
Observability is built in
Metrics, logs, and job state are available out of the box so you can monitor state.
Performance that boosts production workloads
For generative media workloads, small improvements in inference speed compound quickly into large cost and latency gains.
With Dedicated Container Inference, teams benefit from our research pipeline from automatic kernel optimizations to hands-on profiling and tuning for workload-specific performance improvements.
"Infrastructure costs can kill an AI company as they scale. Together's Dedicated Container Inference solved two critical problems for us: handling unpredictable viral traffic without over-provisioning, and taking our already-competitive model performance to the next level.
Their research team achieved significant lossless speedups that directly improved our unit economics—without sacrificing quality. They didn't just provide GPUs; they partnered with us to make our inference more efficient at scale. That level of technical partnership, combined with production-grade infrastructure, let us focus on building products instead of managing clusters."
— Ledell Wu, Co-Founder & Chief Research Scientist, Creatify

Across production deployments, we have seen:
- Large reductions in real-time factor for video generation models, moving from tens of seconds per second of output down to low double digits
- Meaningful speedups on avatar and video synthesis models through better batching, scheduling, and multi-GPU execution
- End-to-end improvements that turn previously uneconomical models into viable production services
These gains come from a combination of platform capabilities and hands-on optimization work. The result is lower latency, lower cost per output second, and more predictable scaling behavior under load.
“Together AI’s researchers partnered with us to optimize our model’s inference performance. It was hands-on work that made our model meaningfully faster.
Their infrastructure absorbs viral traffic without breaking a sweat. During major surges, Dedicated Container Inference scales seamlessly while maintaining performance.
And because we trained on Together’s Accelerated Compute, deploying to production was frictionless. One platform, zero artifact transfers, no deployment headaches.”
— Terrence Wang, Founding ML Engineer, Hedra
Getting started
If you are deploying a custom model and need production-grade orchestration, Dedicated Container Inference is designed for that use case. You keep full control over your container and inference logic. We manage deployment, scaling, queuing, and monitoring.
LOREM IPSUM
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
LOREM IPSUM
Audio Name
Audio Description
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
List Item #2
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
List Item #3
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Build
Benefits included:
✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.
Funding: Less than $5M
Grow
Benefits included:
✔ Up to $30K in free platform credits*
✔ 6 hours of free forward-deployed engineering time.
Funding: $5M-$10M
Scale
Benefits included:
✔ Up to $50K in free platform credits*
✔ 10 hours of free forward-deployed engineering time.
Funding: $10M-$25M
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:
Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:
Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.
- A. 2.08*1e-1 m
- B. 2.08*1e-9 m
- C. 2.08*1e-6 m
- D. 2.08*1e-3 m
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:
Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by
- A. releasing nitrogen in the soil.
- B. crowding out non-native species.
- C. adding carbon dioxide to the atmosphere.
- D. removing water from the soil and returning it to the atmosphere.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:
Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:
Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?
article