Interested in running DeepSeek-R1 in production?

Request access to Together Dedicated Endpoints—private and fast DeepSeek-R1 inference at scale.

  • Fastest inference: Our DeepSeek-R1 API runs 10x faster than DeepSeek's API
  • Flexible scaling: Deploy via Together Serverless or dedicated endpoints
  • High throughput: Up to 334 tokens/sec on dedicated infrastructure
  • Secure & reliable: Private, compliant, and built for production
Trusted by

DeepSeek-R1 on Together AI

Unmatched performance. Cost-effective scaling. Secure infrastructure.

Fastest inference engine

We run DeepSeek-R1 10x faster than DeepSeek's API, ensuring low-latency performance for production workloads.

Scalable infrastructure

Whether you're just starting out or scaling to production workloads, choose from Together Serverless APIs for flexible, pay-per-token usage or dedicated endpoints for predictable, high-volume operations.

Security-first approach

We host all models in our own data centers, with no data sharing back to DeepSeek. Developers retain full control over their data with opt-out privacy settings.

Deployment options

Run models using different deployment options depending on latency needs, traffic patterns, and infrastructure control.

  • Serverless

  • Inference

Serverless Inference

Real-time

A fully managed inference API that automatically scales with request volume.

Best for

Variable or unpredictable traffic

Rapid prototyping and iteration

Cost-sensitive or early-stage production workloads

Batch

Process massive workloads of up to 30 billion tokens asynchronously, at up to 50% less cost.

Best for

Classifying large datasets

Offline summarization

Synthetic data generation

Dedicated Inference

Dedicated Model Inference

An inference endpoint backed by reserved, isolated compute resources and the Together AI inference engine.

Best for

Predictable or steady traffic

Latency-sensitive applications

High-throughput production workloads

Dedicated Container Inference

Run inference with your own engine and model on fully-managed, scalable infrastructure.

Best for

Generative media models

Non-standard runtimes

Custom inference pipelines