Models / DEEPSEEK
DeepSeek
See the reasoning. Slash the bill.
DeepSeek is the first open-weight model to outperform GPT-4 with transparent reasoning tokens, at one-tenth the price. Build with confidence.

Get Started in Minutes
Drop-in OpenAI replacement—no code changes, no surprises on your bill. Switch from closed models to DeepSeek instantly with OpenAI-compatible endpoints
Why DeepSeek on Together AI?
Know exactly why your model answers the way it does.
The first reasoning models with fully transparent reasoning tokens, proven benchmark superiority, and complete model ownership for advanced enterprise deployment.
Unmatched Performance
Native chain-of-thought reasoning built into model architecture through large-scale reinforcement learning. DeepSeek R1 exposes its complete thinking process in <think> tags, enabling debugging and verification of model decisions.
DeepSeek R1 beats OpenAI o1 on verified benchmarks
Breakthrough Economics
Mixture-of-experts architecture activates only 37B of 671B parameters per token, delivering frontier performance at dramatically reduced computational cost and faster inference speeds.
90% cost reduction vs Closed models without quality compromise
Full Model Control
Download the weights or call the API—deploy on Together’s cloud or on-prem. No vendor lock-in.
Complete data & model ownership vs closed models
Meet the DeepSeek Pod
From ultra-efficient reasoning to efficient MoE design, choose the DeepSeek model that fits your needs.

The real* DeepSeek whale seen leaping out of the water near the Golden Gate Bridge.
Breakthrough Technical Innovations
DeepSeek models introduce game-changing architectural advances that redefine reasoning in open-source AI.
Mixture of Experts (MoE)
Sparse expert routing activates only 37B out of 671B parameters for each token in V3. Advanced load balancing without auxiliary losses maintains performance while reducing computational cost.
V3: 671B params, 37B active per token
Group Relative Policy Optimization (GRPO)
New RL approach that removes separate value networks in RLHF, using grouped relative advantage estimation to cut compute requirements while maintaining training stability.
R1: First major reasoning model trained with GRPO methodology
Native Reasoning Transparency
First reasoning model to expose complete thinking process in <think> tags. Native reasoning capabilities built into model foundation through large-scale reinforcement learning.
Pure RL training methodology enables step-by-step transparency
MetaP Training
First successful implementation of FP8 mixed precision training on a 671B parameter model. Pioneering reinforcement learning approach without supervised fine-tuning as preliminary step.
V3: 2.788M H800 GPU hours
Multi-Head Latent Attention
Innovative attention mechanism that reduces KV-cache memory requirements while maintaining modeling performance. Optimized for efficient inference deployment.
Optimized for inference efficiency
Multi-Token Prediction
Novel training objective that allows the model to predict multiple tokens simultaneously. Enhanced performance and efficiency through advanced training techniques.
V3 Enhanced performance & efficiency optimization
Deploy on Together AI
Access DeepSeek models through Together's optimized inference platform with enterprise-grade security and performance guarantees.
Serverless Endpoints
Pay-per-token pricing with automatic scaling. Perfect for getting started or variable workloads.
Best for:
Prototyping and development
Variable or unpredictable traffic
Cost optimization for low volume
Getting started quickly
DeepSeek R1-0528:
Starting at $0.55/1M tokens
DeepSeek V3:
Starting at $1.25/1M tokensOn-Demand Dedicated
Dedicated GPU capacity with guaranteed performance. No rate limits. Built for production.
Best for:
Production applications
Extended model library access
Predictable latency requirements
Enterprise SLA needs
DeepSeek R1-0528:
$0.67/minute (8x H200)
DeepSeek V3:
$0.67/minute (8x H200)Monthly Reserved
Committed GPU capacity, enterprise features and volume discounts. Optimized for scale.
Best for:
High-volume committed usage
Enterprise security requirements
Priority hardware access
Maximum cost efficiency
Reserved GPU pricing:
Starting $0.98/hr
Volume Discounts:
Up to 40% savings
Enterprise-Grade Security
Your data and models remain fully under your control with industry-leading security standards.
SOC 2 Type II
Comprehensive security controls audited by third parties.
HIPAA Compliant
Healthcare-grade data protection for sensitive workloads.
Model Ownership
You own your fine-tuned models and can deploy anywhere.
US-Based Infrastructure
Models hosted on secure North American servers with strict data sovereignty controls.
Real Performance Benchmarks
See how DeepSeek models stack up against the competition on verified benchmarks that matter
Try DeepSeek Models - Free
Experience the performance difference in Together Chat.
Frequently Asked Questions
How does DeepSeek R1's reasoning compare to OpenAI o1?
DeepSeek R1 offers superior reasoning capabilities with native chain-of-thought built into the architecture. Verified benchmarks show R1 achieving 97.3% on MATH-500 vs o1's 96.4%, with full transparency of the reasoning process through <think> tags.
What are the current pricing rates for DeepSeek models?
- DeepSeek R1: $3 input / $7 output per million tokens
- DeepSeek R1 Throughput: $0.55 input / $2.19 output per million tokens
- DeepSeek V3: $1.25 per million tokens
All models offer 70–90% cost savings compared to similar closed models.
Can I fine-tune DeepSeek models on my own data?
Yes! DeepSeek models are open-weight with MIT licensing, meaning you can fine-tune them for your specific use cases, own the resulting model weights, and use them commercially. Deploy anywhere without restrictions.
What are the context length limits for each model?
- DeepSeek V3: 128K token context length
- DeepSeek R1: 128K token context length
How do I migrate from OpenAI to DeepSeek on Together AI?
Migration is seamless with Together AI’s OpenAI-compatible API. Simply change the base URL and model name in your existing code. Same API format, better reasoning, and transparent costs.
What makes the Mixture of Experts (MoE) architecture special?
DeepSeek models use MoE architecture where only 37B of 671B parameters activate per token. This delivers frontier performance at dramatically reduced computational cost and faster inference speeds.
Is there really a free DeepSeek model?
Yes! DeepSeek R1 Distilled Llama 70B Free is completely free with reduced rate limits. It beats GPT-4o on math problems and matches o1-mini on coding tasks.
What's the difference between R1 and the distilled models?
R1 is the full 671B parameter reasoning model. Distilled models are smaller (1.5B–70B) versions trained on reasoning examples from R1, offering similar capabilities at lower cost and faster speeds.