On-demand dedicated endpoints: run inference with unmatched price-performance & control at scale
Today, we are excited to announce on-demand Dedicated Endpoints—now available with up to 43% lower pricing, delivering the best price-performance in dedicated GPU inference.
Scaling AI applications require reliable, high-performance, and cost-efficient GPU compute, however, finding an offering that balances flexibility and affordability remains a challenge for many startups. Leading AI companies—like BlackBox AI and DuckDuckGo—have successfully brought their AI apps to production with Together Serverless Inference. Serverless deployments offer unmatched flexibility and ease of use, making them an ideal starting point for companies running AI apps in production
As companies scale their generative AI applications in production, they often need guaranteed performance, more control and customizability over their deployment, and support for custom models. That’s where Together Dedicated Endpoints come in. Bridging the gap between the flexibility of serverless deployments and the reserved capacity Together GPU Clusters, Dedicated Endpoints deliver optimal price-performance for scaling AI inference in production:
- The same high performance as serverless—but with single-tenancy, ensuring your traffic is never impacted by other users.
- The most price-competitive dedicated GPU inference available today—up to 50% cheaper than competing providers.
- Substantial cost savings at scale compared to serverless.
- Full control and customizability over the deployment hardware and configuration.
- Support for custom fine-tuned models.
- No minimum commitments, fully self-serve.
With this update, you can spin up on-demand Dedicated Endpoints for dozens of top open-source models—like DeepSeek-R1 and Llama 3.3 70B—or upload your custom fine-tuned model from Hugging Face, deploy it instantly, and start running inference. No upload or storage costs—just pay for the deployment itself.
Performance & control to scale in production
With Dedicated Endpoints, you get full control over a single-tenant GPU instance to deploy any model, without sharing compute resources. Using our web UI, API, or CLI, you can customize your deployment on some of the most powerful NVIDIA GPUs, including HGX H200 and HGX H100, ensuring optimal performance for your workload.

Dedicated Endpoints take the high-performance of serverless to a new level, by guaranteeing consistency with no resource contention. On top of that, they allow you to enable/disable optimizations such as speculative decoding, which enhance text generation efficiency by improving throughput and optimizing the handling of ambiguous input, leading to faster, more responsive outputs.
“Together AI’s Dedicated Endpoints give us precise control over latency, throughput, and concurrency—allowing us to serve more than 10 million monthly active users with BLACKBOX AI autonomous coding agents. The flexibility of autoscaling, combined with exceptional engineering support, has been crucial in accelerating our growth.” – Robert Rizk, Co-Founder and CEO of BLACKBOX AI
Unmatched cost savings at scale
Because we have built a full end-to-end AI platform—including our own high-performance GPU infrastructure—we’re able to offer the most competitive pricing for dedicated deployments alongside one of the broadest selections of GPU architectures available today.
Now, we are reducing our Together Dedicated Endpoints prices up to 43%, making this deployment the most cost-effective dedicated GPU inference solution. These updates result in pricing up to 50% lower than other inference providers, delivering unmatched value for scalable, high-performance AI deployments.
Lower total cost than serverless at scale
Thanks to these price reductions and our optimized inference engine, Together AI Dedicated Endpoints often reduce overall costs compared to serverless once you reach a certain scale.
The table below compares pricing between a serverless deployment and a Dedicated Endpoint deployment with two H100 GPUs for some of our most popular serverless models.
These figures show that workloads with average volumes starting at 130,000 tokens/minute are likely to become more economical in a Dedicated Endpoint deployment (with two H100 GPUs) as compared to serverless.
With Dedicated Endpoints, you get the best of both worlds—high performance, flexible scaling, full customizability, and no upfront commitments—all at industry-leading pricing. Read our docs to configure your first dedicated endpoint in seconds and contact us for reserved GPU capacity.
Handle usage spikes seamlessly with scaling
Unlike serverless, where compute resources are shared, Dedicated Endpoints provide isolated, single-tenancy compute, ensuring that your resources remain dedicated exclusively to your workloads. Additionally, it gives you full control over configuring vertical and horizontal scaling options to handle any level of demand.
Scale vertically with more GPUs
If you need more compute power, you can scale your deployment vertically by increasing the GPU count. For example, you can adjust the configuration to deploy with 2, 4 or 8 GPUs per replica.
Scale horizontally with replicas
To ensure your endpoint has the capacity to handle peak workloads, you can set automatic scaling boundaries by defining a minimum and maximum replica count. When traffic spikes beyond your base capacity, additional replicas spin up on demand—ensuring consistent performance with zero manual intervention. You only pay for the extra replicas while they’re running, keeping costs optimized.
These scaling capabilities make Dedicated Endpoints the ideal choice for mission-critical AI applications that require:
- Reliable QPS with no risk of overload.
- Predictable availability, even under unpredictable traffic.
- Seamless handling of surges without performance dips.
Thanks to these scaling options, Together Dedicated Endpoints enable your AI models to scale seamlessly with demand without breaking performance or budget.
Pick the deployment that fits your needs
With this update, Together AI now offers the most comprehensive set of deployment options for inference, ensuring you get the right balance of flexibility, performance, and cost-efficiency.
If you’re unsure which deployment model best suits your needs, here’s a quick comparison:
Need help deciding? Read our docs to explore deployment options or contact us to discuss your specific requirements.
{{custom-cta-1}}
Deploy custom fine-tuned models in Dedicated Endpoints
With these improvements, our Dedicated Endpoints give developers a fast, flexible way to test, benchmark, and deploy models.
Today, we’re taking this a step further by introducing support for uploading fine-tuned versions of popular open-source models and running them on Dedicated Endpoints.
Our new API—available to all developers on Together AI paid tiers—makes deploying custom fine-tuned models easy:
- Upload supported models from Hugging Face with a simple API call (see the list of supported model architectures in our Docs).
- Deploy instantly on a Dedicated Endpoint for high-performance inference.
- Only pay for deployment—no upload fees, no storage costs.
Run the models you need, how you need them, with the best price-performance ratio for dedicated GPU inference.
Deploy a fine-tuned model to Together AI
Check out our Docs to upload a custom fine-tuned model from Hugging Face and test them on the Together platform.
LOREM IPSUM
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
LOREM IPSUM
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
List Item #2
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
List Item #3
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Build
Benefits included:
✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.
Funding: Less than $5M
Grow
Benefits included:
✔ Up to $30K in free platform credits*
✔ 6 hours of free forward-deployed engineering time.
Funding: $5M-$10M
Scale
Benefits included:
✔ Up to $50K in free platform credits*
✔ 10 hours of free forward-deployed engineering time.
Funding: $10M-$25M
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:
Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:
Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.
- A. 2.08*1e-1 m
- B. 2.08*1e-9 m
- C. 2.08*1e-6 m
- D. 2.08*1e-3 m
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:
Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by
- A. releasing nitrogen in the soil.
- B. crowding out non-native species.
- C. adding carbon dioxide to the atmosphere.
- D. removing water from the soil and returning it to the atmosphere.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:
Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:
Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?
article