💰 Announcing our Series C. Intelligence should be abundant, not expensive →

📊 Delivering 31% more TPS than the next-fastest OSS engine for production coding agent workloads →

⚡ On-demand B200s now available on Together GPU Clusters →

🚀 Now serving MiniMax-M3 for efficient inference →

Models / Deep Cogito

Deep Cogito

Deploy Cogito v2 models on Together AI. Iterative self-improvement, 60% shorter reasoning chains, and frontier performance under open license.

Why Deep Cogito on Together AI?

Designed for production workloads that need  consistent performance and operational control.

Iterative self-improvement

First reasoning models to improve core intelligence, not just search time. Models develop stronger intuition through distillation of reasoning processes — delivering 60% shorter reasoning chains than DeepSeek R1 with superior performance.

Breakthrough efficiency

Complete model family trained for under $3.5M total cost. Significantly more efficient than capital-intensive approaches — proving superintelligence research is accessible to the broader ecosystem.

Open superintelligence

All models released under open license for commercial use. Complete transparency in the reasoning process with visible thinking tags — build on the research or deploy without restrictions.

Meet the Deep Cogito family

Explore top-performing models across text, image, video, code, and voice.

Deploy own model

Chat

Cogito v2 preview - 671B MoE

Chat

Cogito v2.1 671B

Chat

Cogito v2 preview - 405B

Chat

Cogito V1 Preview Llama 70B

Chat

Cogito V1 Preview Qwen 32B

Chat

Cogito V1 Preview Qwen 14B

Chat

Cogito V1 Preview Llama 3B

Chat

Cogito v2 preview - 109B MoE

Chat

Cogito v2 preview - 70B

Have your own model?

Deploy custom containers on Together’s managed GPU infrastructure with automatic scaling, job queues, and built-in observability.

Deployment options

Run models using different deployment options depending on latency needs, traffic patterns, and infrastructure control.

Serverless Inference
Provisioned  Throughput
Dedicated Model  Inference
Dedicated Container  Inference

Serverless Inference

A fully managed real-time or batch inference API with access to dozens of the most popular AI models.

Best for

Variable or unpredictable traffic

Rapid prototyping and iteration

Cost-sensitive or early-stage production workloads

Provisioned  Throughput

Reserved token capacity with SLA guarantees. Priced in PTUs, a normalized throughput unit.

Best for

Production workloads

Reliability guarantees

Predictable pricing

Dedicated Model  Inference

An inference endpoint backed by reserved, isolated compute resources and Together AI inference research.

Best for

Predictable or steady traffic

Latency-sensitive applications

High-throughput production workloads

Dedicated Container  Inference

Run inference with your own engine and model on fully-managed, scalable infrastructure.

Best for

Generative media models

Non-standard runtimes

Custom inference pipelines