Batch Inference

Process massive workloads asynchronously

Scale to 30 billion tokens per model with any serverless model or private deployment.

Start building now

Explore the Docs

Why Batch Inference  with Together AI?

Lower cost, higher limits, and predictable processing

Up to 50% cost savings

Run batch jobs at up to half the cost of our real-time API for most serverless models. Process millions of requests economically without sacrificing quality or speed.

30B enqueued tokens per model

Run massive batch jobs that scale to 30 billion enqueued tokens per model per user. Need more? We'll customize limits for your specific use case.

<24h processing time SLA

Jobs consistently finish well under 24 hours — often within just hours. Submit and forget while we handle the scale.

Key capabilities, purpose built for AI natives

Run massive asynchronous inference jobs against a serverless model or dedicated model inference.

- Universal model access
  Any serverless model
  Dedicated deployments
  Run batch jobs across any serverless model or private deployment — no limitations on model choice.
  Explore the docs
- Up and running in minutes
  Launch in three steps
  No DevOps required
  No orchestration
  Launch massive inference jobs simply by uploading a JSONL file. Start processing batches with just a few clicks. No orchestration or monitoring setup required.
  Explore the docs
- 50% off many top models
  DeepSeek, Llama, Qwen & more
  No minimum volume
  Run batch jobs at up to half the cost of our real-time API for most serverless models. Process millions of requests without sacrificing quality or speed.
  Explore the docs

Research-optimized, best-in-class performance

We achieved up to 2x faster serverless inference for the most demanding LLMs, including GPT-OSS, Qwen, Kimi, and DeepSeek.

GPT-OSS-20B
Qwen3 235B 2507
Kimi K2 0905
DeepSeek V3.1
DeepSeek R1 0528

Output speed: gpt-oss-20B (low) providers
Together AI
Vexteer
Lightning
DatabricKs
Nebius base
Novita
Amazon
Cloudfare
Hyperbolic
Together AI vs other providers
2x faster
We achieved nearly 2x faster serverless inference performance for gpt-oss-20B when compared with the next fastest provider.
learn more
Output Speed: Qwen3 235B 2507 providers
Together AI (FP8)
Amazon
Lightning
DatabricKs
Nebius base
Novita
Amazon
Cloudfare
Hyperbolic
Hyperbolic
Hyperbolic
Hyperbolic
Together ai vs other providers
2.75x faster
We achieved nearly 2x faster serverless inference performance for gpt-oss-20B when compared with the next fastest provider.
learn more
Output Speed: Kimi K2 0905 providers
Together AI
Fireworks
Baseten (FP4)
Parasail
Deepinfra
Novita
Together ai vs other providers
65% faster
We achieved over 65% faster serverless inference performance forKimi-K2-0905 when compared with the next fastest provider.
learn more
Output Speed: DeepSeek V3.1 providers
Together AI
Fireworks
Baseten (FP4)
Vertex
Parasail(FP8)
Lightning AI
Amazon
GMI (FP8)
Novita
Deepinfra(FP4)
Together ai vs other providers
10% faster
We achieved over 10% faster serverless inference performance forDeepSeek-V3.1 when compared with the next fastest provider.
learn more
Output Speed: DeepSeek R1 0528 providers
Together AI
Neibus fast(FP4)
Fireworks Fast
Vertex
Azure
Together.ai (Throughput)
Hyperbolic
Deepinfra
Novita
Parasail
Nebius
Together ai vs other providers
13% faster
We achieved over 13% faster serverless inference performance forDeepSeek-R1-0528 when compared with the next fastest provider.
learn more

Deployment options

Run models using different deployment options depending on latency needs, traffic patterns, and infrastructure control.

Serverless
Inference

Serverless Inference

Real-time

A fully managed inference API that automatically scales with request volume.

Best for

Variable or unpredictable traffic

Rapid prototyping and iteration

Cost-sensitive or early-stage production workloads

Get started

Explore Docs

Batch

Process massive workloads of up to 30 billion tokens asynchronously, at up to 50% less cost.

Best for

Classifying large datasets

Offline summarization

Synthetic data generation

Get started

Explore Docs

Dedicated Inference

Dedicated Model Inference

An inference endpoint backed by reserved, isolated compute resources and the Together AI inference engine.

Best for

Predictable or steady traffic

Latency-sensitive applications

High-throughput production workloads

Get started

Explore Docs

Dedicated Container Inference

Run inference with your own engine and model on fully-managed, scalable infrastructure.

Best for

Generative media models

Non-standard runtimes

Custom inference pipelines

Contact sales

Explore Docs

Production-grade
security and data privacy

We take security and compliance seriously, with strict data privacy controls to keep your information protected. Your data and models remain fully under your ownership, safeguarded by robust security measures.

Learn More

NVIDIA preferred partner
AICPA SOC 2 Type II

Customers running inference in production

View All Stories

Smiling man with brown hair, beard, and glasses wearing a light blue button-up shirt indoors.

30B
Enqueued tokens
24h
SLA

"We rely on the Batch Inference API to process very large amounts of requests. The high rate limits—up to 30B enqueued tokens—let us run massive experiments without bottlenecks, and jobs consistently finish well under the 24-hour SLA, often within just hours. It’s transformed the pace at which we can test and iterate."

Volodymyr Kuleshov

Co-Founder, Inception Labs

View All Stories

Process massive workloads asynchronously

Why Batch Inference with Together AI?

Key capabilities, purpose built for AI natives

Research-optimized, best-in-class performance

Deployment options

Serverless Inference

Real-time

Batch

Dedicated Inference

Dedicated Model Inference

Dedicated Container Inference

Customers running inference in production

Why Batch Inference  with Together AI?