Batch Inference

Process massive workloads asynchronously

Scale to 30 billion tokens per model with any serverless model or private deployment.

Why Batch Inference 
with Together AI?

Lower cost, higher limits, and predictable processing

Up to 50% cost savings

Run batch jobs at up to half the cost  of our real-time API for most serverless models. Process millions of requests economically without sacrificing quality or speed.

30B enqueued tokens per model

Run massive batch jobs that scale to 30 billion enqueued tokens per model per user. Need more? We'll customize limits for your specific use case.

<24h processing time SLA

Jobs consistently finish well under 24 hours — often within just hours. Submit and forget while we handle the scale.

Key capabilities, purpose built for AI natives

Run massive asynchronous inference jobs against a serverless model or dedicated model inference.

    • Universal model access

      Any serverless model
      Dedicated deployments

      Run batch jobs across any serverless model or private deployment — no limitations on model choice.

    • Up and running in minutes

      Launch in three steps
      No DevOps required
      No orchestration

      Launch massive inference jobs simply by uploading a JSONL file. Start processing batches with just a few clicks. No orchestration or monitoring setup required.

    • 50% off many top models

      DeepSeek, Llama, Qwen & more
      No minimum volume

      Run batch jobs at up to half the cost of our real-time API for most serverless models. Process millions of requests without sacrificing quality or speed.

Research-optimized, best-in-class performance

We achieved up to 2x faster serverless inference for the most demanding LLMs, including GPT-OSS, Qwen, Kimi, and DeepSeek.

  • GPT-OSS-20B
  • Qwen3 235B 2507
  • Kimi K2 0905
  • DeepSeek V3.1
  • DeepSeek R1 0528
  • Output speed: gpt-oss-20B (low) providers

    Together AI

    Vexteer

    Lightning

    DatabricKs

    Nebius base

    Novita

    Amazon

    Cloudfare

    Hyperbolic

    Together AI vs other providers

    2x faster

    We achieved nearly 2x faster serverless inference performance for gpt-oss-20B when compared with the next fastest provider.

    learn more
  • Output Speed: Qwen3 235B 2507 providers

    Together AI (FP8)

    Amazon

    Lightning

    DatabricKs

    Nebius base

    Novita

    Amazon

    Cloudfare

    Hyperbolic

    Hyperbolic

    Hyperbolic

    Hyperbolic

    Together ai vs other providers

    2.75x faster

    We achieved nearly 2x faster serverless inference performance for gpt-oss-20B when compared with the next fastest provider.

    learn more
  • Output Speed: Kimi K2 0905 providers

    Together AI

    Fireworks

    Baseten (FP4)

    Parasail

    Deepinfra

    Novita

    Together ai vs other providers

    65% faster

    We achieved over 65% faster serverless inference performance forKimi-K2-0905 when compared with the next fastest provider.

    learn more
  • Output Speed: DeepSeek V3.1 providers

    Together AI

    Fireworks

    Baseten (FP4)

    Vertex

    Parasail(FP8)

    Lightning AI

    Amazon

    GMI (FP8)

    Novita

    Deepinfra(FP4)

    Together ai vs other providers

    10% faster

    We achieved over 10% faster serverless inference performance forDeepSeek-V3.1 when compared with the next fastest provider.

    learn more
  • Output Speed: DeepSeek R1 0528 providers

    Together AI

    Neibus fast(FP4)

    Fireworks Fast

    Vertex

    Azure

    Together.ai (Throughput)

    Hyperbolic

    Deepinfra

    Novita

    Parasail

    Nebius

    Together ai vs other providers

    13% faster

    We achieved over 13% faster serverless inference performance forDeepSeek-R1-0528 when compared with the next fastest provider.

    learn more

Deployment options

Run models using different deployment options depending on latency needs, traffic patterns, and infrastructure control.

  • Serverless

  • Inference

Serverless Inference

Real-time

A fully managed inference API that automatically scales with request volume.

Best for

Variable or unpredictable traffic

Rapid prototyping and iteration

Cost-sensitive or early-stage production workloads

Batch

Process massive workloads of up to 30 billion tokens asynchronously, at up to 50% less cost.

Best for

Classifying large datasets

Offline summarization

Synthetic data generation

Dedicated Inference

Dedicated Model Inference

An inference endpoint backed by reserved, isolated compute resources and the Together AI inference engine.

Best for

Predictable or steady traffic

Latency-sensitive applications

High-throughput production workloads

Dedicated Container Inference

Run inference with your own engine and model on fully-managed, scalable infrastructure.

Best for

Generative media models

Non-standard runtimes

Custom inference pipelines

Production-grade
security and data privacy

We take security and compliance seriously, with strict data privacy controls to keep your information protected. Your data and models remain fully under your ownership, safeguarded by robust security measures.

Learn More

We take security and compliance seriously, with strict data privacy controls to keep your information protected. Your data and models remain fully under your ownership, safeguarded by robust security measures.

  • NVIDIA preferred partner
  • AICPA SOC 2 Type II

Customers running inference in production

  • 30B

    Enqueued tokens

  • 24h

    SLA

"We rely on the Batch Inference API to process very large amounts of requests. The high rate limits—up to 30B enqueued tokens—let us run massive experiments without bottlenecks, and jobs consistently finish well under the 24-hour SLA, often within just hours. It’s transformed the pace at which we can test and iterate."

Volodymyr Kuleshov

Co-Founder, Inception Labs