Serverless Inference

The fastest way to run open‑source models on demand

High-performance inference, powered by our in-house research. No infrastructure to manage,  no long-term commitments.

Get started

Explore demos

Serverless inference on Together AI

Access all the top open-source models in one place.

Up to 2.75x faster inference

Powered by next-gen GPUs and key innovations, we deliver inference speeds around 2x faster for the next fastest provider.

Every modality, one API

Text, image, video, code, and voice. Access the full generative AI stack without stitching together multiple providers.

Built on cutting-edge  systems research

Inference performance is driven by continuous optimization across kernels, scheduling, and runtime systems.

Build with leading models

Explore top-performing models across text, image, video, code, and voice.

Browse models

Deploy own model

new

Chat

GLM-5.1

New

Chat

MiniMax M2.7

new

Chat

Kimi K2.6

new

Chat

DeepSeek V4 Pro

new

Chat

DeepSeek V4 Flash

new

Chat

Qwen3.6-Plus

new

Chat

Qwen3.7-Max

Chat

gpt-oss-120B

Chat

NVIDIA Nemotron 3 Super

new

Chat

LFM2 24B A2B

Chat

Qwen3.5-397B-A17B

Chat

MiniMax M2.5

Chat

GLM-5

Chat

Qwen3-Coder-Next

Chat

Kimi K2.5

Image

Wan 2.6 Image

Image

GPT Image 1.5

New

Chat

Qwen3.5 9B

new

Chat

Gemma 4 31B

New

Chat

NVIDIA Nemotron 3 Nano Omni

Have your own model?

Deploy custom containers on Together’s managed GPU infrastructure with automatic scaling, job queues, and built-in observability.

Learn more

Key capabilities, purpose built for AI natives

Scale from self-serve instant clusters to thousands of GPUs, all optimized for better performance with Together Kernel Collection.

- Adaptive speculative decoding
  Faster Outputs
  Lower latency
  Lossless quality
  Reduce end-to-end latency by predicting and validating multiple tokens per step instead of decoding strictly sequentially. AdapTive-LeArning Speculative System (ATLAS) learns from production traffic to further accelerate inference.
  Learn more
- OpenAI-compatible API
  No code changes
  Full model control
  Lower cost
  Same API, better models. No code changes required. Drop in your API key and access hundreds of open-source models through the same interface you're already using.
  Explore the docs
- Quantization without compromise
  No quality loss
  Faster inference
  Lower cost
  Run quantized models at full quality — our intelligent quantization reduces compute costs and improves speed without sacrificing output accuracy.
  Explore the docs
Show our OpenAI compatible api import os import openai client = openai.OpenAI( api_key=os.environ.get("TOGETHER_API_KEY"), base_url="https://api.together.xyz/v1", ) response = client.chat.completions.create( model="openai/gpt-oss-20b", messages=[ { "role": "system", "content": "You are a travel agent. Be descriptive and helpful.", }, { "role": "user", "content": "Tell me the top 3 things to do in San Francisco", }, ], ) print(response.choices[0].message.content)

Research-optimized, best-in-class performance

We achieved up to 2x faster serverless inference for the most demanding LLMs, including GPT-OSS, Qwen, Kimi, and DeepSeek.

GPT-OSS-20B
Qwen3 235B 2507
Kimi K2 0905
DeepSeek V3.1
DeepSeek R1 0528

Output speed: gpt-oss-20B (low) providers
Together AI
Vexteer
Lightning
DatabricKs
Nebius base
Novita
Amazon
Cloudfare
Hyperbolic
Together AI vs other providers
2x faster
We achieved nearly 2x faster serverless inference performance for gpt-oss-20B when compared with the next fastest provider.
learn more
Output Speed: Qwen3 235B 2507 providers
Together AI (FP8)
Amazon
Lightning
DatabricKs
Nebius base
Novita
Amazon
Cloudfare
Hyperbolic
Hyperbolic
Hyperbolic
Hyperbolic
Together ai vs other providers
2.75x faster
We achieved nearly 2x faster serverless inference performance for gpt-oss-20B when compared with the next fastest provider.
learn more
Output Speed: Kimi K2 0905 providers
Together AI
Fireworks
Baseten (FP4)
Parasail
Deepinfra
Novita
Together ai vs other providers
65% faster
We achieved over 65% faster serverless inference performance forKimi-K2-0905 when compared with the next fastest provider.
learn more
Output Speed: DeepSeek V3.1 providers
Together AI
Fireworks
Baseten (FP4)
Vertex
Parasail(FP8)
Lightning AI
Amazon
GMI (FP8)
Novita
Deepinfra(FP4)
Together ai vs other providers
10% faster
We achieved over 10% faster serverless inference performance forDeepSeek-V3.1 when compared with the next fastest provider.
learn more
Output Speed: DeepSeek R1 0528 providers
Together AI
Neibus fast(FP4)
Fireworks Fast
Vertex
Azure
Together.ai (Throughput)
Hyperbolic
Deepinfra
Novita
Parasail
Nebius
Together ai vs other providers
13% faster
We achieved over 13% faster serverless inference performance forDeepSeek-R1-0528 when compared with the next fastest provider.
learn more

Deployment options

Run models using different deployment options depending on latency needs, traffic patterns, and infrastructure control.

Serverless
Inference

Serverless Inference

Real-time

A fully managed inference API that automatically scales with request volume.

Best for

Variable or unpredictable traffic

Rapid prototyping and iteration

Cost-sensitive or early-stage production workloads

Get started

Explore Docs

Batch

Process massive workloads of up to 30 billion tokens asynchronously, at up to 50% less cost.

Best for

Classifying large datasets

Offline summarization

Synthetic data generation

Get started

Explore Docs

Dedicated Inference

Dedicated Model Inference

An inference endpoint backed by reserved, isolated compute resources and the Together AI inference engine.

Best for

Predictable or steady traffic

Latency-sensitive applications

High-throughput production workloads

Get started

Explore Docs

Dedicated Container Inference

Run inference with your own engine and model on fully-managed, scalable infrastructure.

Best for

Generative media models

Non-standard runtimes

Custom inference pipelines

Contact sales

Explore Docs

Production-grade
security and data privacy

We take security and compliance seriously, with strict data privacy controls to keep your information protected. Your data and models remain fully under your ownership, safeguarded by robust security measures.

Learn More

NVIDIA preferred partner
AICPA SOC 2 Type II

Customers running inference in production

View All Stories

How Cursor partnered with Together AI to deliver real-time, low-latency inference at scale

Inference, GPU clusters, RESEARCH • Enterprise

Four young men standing closely on a sunny street with a domed building in the background.

Smiling young man with light brown hair wearing a blue patterned shirt in a softly blurred indoor setting.

~30%
Cost savings

"Together has helped us deploy VyUI, our state-of-the-art computer AI model. We had multiple in-depth meetings where we brainstormed how we could satisfy our model's custom technical requirements while still leveraging Together's infrastructure for efficient, load-balanced inference."

Luca Weihs

Co-founder, Vercept

Smiling man with short dark hair wearing a black shirt and dark gray blazer against a light background.

"Together AI offers optimized performance at scale, and at a lower cost than closed-source providers – all while maintaining strict privacy standards."

Vineet Khosla

CTO, The Washington Post

Smiling man with black hair and glasses wearing a light blue collared shirt against a white background.

~33%
Cost savings
2x
Latency reduction

"We’ve been thoroughly impressed with Together. They delivered a 2x reduction in latency and cut our costs by approximately a third."

Caiming Xiong

VP, Salesforce AI Research

View All Stories