Serverless Inference

The fastest way to run opensource models on demand

High-performance inference, powered by our in-house research. No infrastructure to manage, 
no long-term commitments.

Serverless inference on Together AI

Access all the top open-source models in one place.

Up to 2.75x faster inference

Powered by next-gen GPUs and key innovations, we deliver inference speeds around 2x faster for the next fastest provider.

Every modality, one API

Text, image, video, code, and voice. Access the full generative AI stack without stitching together multiple providers.

Built on cutting-edge 
systems research

Inference performance is driven by continuous optimization across kernels, scheduling, and runtime systems.

Build with leading models

Explore top-performing models across text, image, video, code, and voice.

New

Image

Nano Banana Pro (Gemini 3 Pro Image)

new

Chat

GLM-5

New

Chat

Kimi K2.5

New

Chat

gpt-oss-120B

New

Code

DeepSeek-V3.2-Exp

new

Chat

Ministral 3 8B Instruct 2512

new

Audio

MiniMax Speech 2.6 Turbo

new

Chat

LFM2 24B A2B

new

Code

Qwen3-Coder-Next

new

Image

Wan 2.6 Image

new

Image

GPT Image 1.5

Chat

Gemma 3 27B

Chat

Llama 4 Maverick

Chat

Qwen3 235B A22B Instruct 2507 FP8

Video

Google Veo 3.0

Image

FLUX.2 [pro]

Chat

NIM Llama 3.1 Nemotron 70B Instruct

New

Chat

Kimi K2 Instruct-0905

New

Video

Sora 2 Pro

new

Chat

Arcee AI Trinity Mini

Key capabilities, purpose built for AI natives

Scale from self-serve instant clusters to thousands of GPUs, all optimized for better performance with Together Kernel Collection.

    • Adaptive speculative decoding

      Faster Outputs
      Lower latency
      Lossless quality

      Reduce end-to-end latency by predicting and validating multiple tokens per step instead of decoding strictly sequentially. AdapTive-LeArning Speculative System (ATLAS) learns from production traffic to further accelerate inference.

    • OpenAI-compatible API

      No code changes
      Full model control
      Lower cost

      Same API, better models. No code changes required. Drop in your API key and access hundreds of open-source models through the same interface you're already using.

    • Quantization without compromise

      No quality loss
      Faster inference
      Lower cost

      Run quantized models at full quality — our intelligent quantization reduces compute costs and improves speed without sacrificing output accuracy.

    • LoRA adapters on serverless

      BRING YOUR OWN ADAPTER
      HF & S3 SUPPORT
      PRIVATE & SECURE

      Upload LoRA adapters from Hugging Face or S3 to run instant inference on serverless models. Deploy private fine-tunes securely without provisioning dedicated infrastructure.

Research-optimized, best-in-class performance

We achieved up to 2x faster serverless inference for the most demanding LLMs, including GPT-OSS, Qwen, Kimi, and DeepSeek.

  • GPT-OSS-20B
  • Qwen3 235B 2507
  • Kimi K2 0905
  • DeepSeek V3.1
  • DeepSeek R1 0528
  • Output speed: gpt-oss-20B (low) providers

    Together AI

    Vexteer

    Lightning

    DatabricKs

    Nebius base

    Novita

    Amazon

    Cloudfare

    Hyperbolic

    Together AI vs other providers

    2x faster

    We achieved nearly 2x faster serverless inference performance for gpt-oss-20B when compared with the next fastest provider.

    learn more
  • Output Speed: Qwen3 235B 2507 providers

    Together AI (FP8)

    Amazon

    Lightning

    DatabricKs

    Nebius base

    Novita

    Amazon

    Cloudfare

    Hyperbolic

    Hyperbolic

    Hyperbolic

    Hyperbolic

    Together ai vs other providers

    2.75x faster

    We achieved nearly 2x faster serverless inference performance for gpt-oss-20B when compared with the next fastest provider.

    learn more
  • Output Speed: Kimi K2 0905 providers

    Together AI

    Fireworks

    Baseten (FP4)

    Parasail

    Deepinfra

    Novita

    Together ai vs other providers

    65% faster

    We achieved over 65% faster serverless inference performance forKimi-K2-0905 when compared with the next fastest provider.

    learn more
  • Output Speed: DeepSeek V3.1 providers

    Together AI

    Fireworks

    Baseten (FP4)

    Vertex

    Parasail(FP8)

    Lightning AI

    Amazon

    GMI (FP8)

    Novita

    Deepinfra(FP4)

    Together ai vs other providers

    10% faster

    We achieved over 10% faster serverless inference performance forDeepSeek-V3.1 when compared with the next fastest provider.

    learn more
  • Output Speed: DeepSeek R1 0528 providers

    Together AI

    Neibus fast(FP4)

    Fireworks Fast

    Vertex

    Azure

    Together.ai (Throughput)

    Hyperbolic

    Deepinfra

    Novita

    Parasail

    Nebius

    Together ai vs other providers

    13% faster

    We achieved over 13% faster serverless inference performance forDeepSeek-R1-0528 when compared with the next fastest provider.

    learn more

Deployment options

Run models using different deployment options depending on latency needs, traffic patterns, and infrastructure control.

  • Serverless

  • Inference

Serverless Inference

Real-time

A fully managed inference API that automatically scales with request volume.

Best for

Variable or unpredictable traffic

Rapid prototyping and iteration

Cost-sensitive or early-stage production workloads

Batch

Process massive workloads of up to 30 billion tokens asynchronously, at up to 50% less cost.

Best for

Classifying large datasets

Offline summarization

Synthetic data generation

Dedicated Inference

Dedicated Model Inference

An inference endpoint backed by reserved, isolated compute resources and the Together AI inference engine.

Best for

Predictable or steady traffic

Latency-sensitive applications

High-throughput production workloads

Dedicated Container Inference

Run inference with your own engine and model on fully-managed, scalable infrastructure.

Best for

Generative media models

Non-standard runtimes

Custom inference pipelines

Production-grade
security and data privacy

We take security and compliance seriously, with strict data privacy controls to keep your information protected. Your data and models remain fully under your ownership, safeguarded by robust security measures.

Learn More

We take security and compliance seriously, with strict data privacy controls to keep your information protected. Your data and models remain fully under your ownership, safeguarded by robust security measures.

  • NVIDIA preferred partner
  • AICPA SOC 2 Type II

Customers running inference in production

  • ~30%

    Cost savings

"Together has helped us deploy VyUI, our state-of-the-art computer AI model. We had multiple in-depth meetings where we brainstormed how we could satisfy our model's custom technical requirements while still leveraging Together's infrastructure for efficient, load-balanced inference."

Luca Weihs

Co-founder, Vercept

    "Together AI offers optimized performance at scale, and at a lower cost than closed-source providers – all while maintaining strict privacy standards."

    Vineet Khosla

    CTO, The Washington Post

    • ~33%

      Cost savings

    • 2x

      Latency reduction

    "We’ve been thoroughly impressed with Together. They delivered a 2x reduction in latency and cut our costs by approximately a third."

    Caiming Xiong

    VP, Salesforce AI Research