Interested in running GPT-OSS in production?

Request access to Together Reasoning Clusters—dedicated, private, and fast OpenAI inference at scale.

  • Faster inference through research-driven optimizations
  • Zero throttling during viral traffic spikes
  • 99.9% uptime SLA with multi-region deployment
  • Superior economics vs proprietary alternatives
  • Transparent pricing with no hidden fees
Trusted by

GPT-OSS on Together AI

Unmatched performance. Cost-effective scaling. Secure infrastructure.

Fastest inference engine

Our research team's innovations, including FlashAttention and custom kernels, deliver up to 50% cost savings and 2x performance improvements.

Scalable infrastructure

Automatic scaling from serverless to dedicated clusters handles everything from prototyping to full production during peak traffic, without throttling.

Reliable & secure

99.9% availability SLA with multi-region deployment and enterprise security hosted on SOC 2 compliant servers in North America ensures your agentic workflows complete successfully.

Deployment options

Run models using different deployment options depending on latency needs, traffic patterns, and infrastructure control.

  • Serverless

  • Inference

Serverless Inference

Real-time

A fully managed inference API that automatically scales with request volume.

Best for

Variable or unpredictable traffic

Rapid prototyping and iteration

Cost-sensitive or early-stage production workloads

Batch

Process massive workloads of up to 30 billion tokens asynchronously, at up to 50% less cost.

Best for

Classifying large datasets

Offline summarization

Synthetic data generation

Dedicated Inference

Dedicated Model Inference

An inference endpoint backed by reserved, isolated compute resources and the Together AI inference engine.

Best for

Predictable or steady traffic

Latency-sensitive applications

High-throughput production workloads

Dedicated Container Inference

Run inference with your own engine and model on fully-managed, scalable infrastructure.

Best for

Generative media models

Non-standard runtimes

Custom inference pipelines