DeepSeek-V4 Pro now available on Together AI

What's New

DeepSeek V4 Pro on Together AI: DeepSeek V4 Pro is now available on Together AI with a 512K-token context window for long-context reasoning workloads.
Large-scale MoE architecture: DeepSeek V4 Pro uses a 1.6T-parameter Mixture-of-Experts architecture with 49B activated parameters.
Controllable reasoning modes: Non-Think, Think High, and Think Max let teams choose between fast responses, deeper reasoning, and maximum reasoning effort.
‍Transparent serverless pricing: DeepSeek V4 Pro is available at \$2.10 per 1M input tokens, \$0.20 per 1M cached input tokens, and \$4.40 per 1M output tokens.

Long-context reasoning changes what teams can ask a model to do. Entire repositories, large document sets, long agent traces, and tool outputs can fit into the model’s working context instead of being compressed into brittle summaries. But the models that can use that much context are also the hardest to serve: a 1.6T-parameter MoE with million-token context is not something most teams want to deploy, tune, and operate themselves.

DeepSeek-V4 Pro is now available on Together AI, the AI Native Cloud, so teams can start with Serverless Inference at 512K context and move to dedicated infrastucture for full 1M context, reserved capacity, and production control. DeepSeek-V4 Flash is coming soon, giving teams another V4 option for workloads where speed and cost matter more than maximum reasoning depth.

At a glance

Spec	Value
Model	DeepSeek V4 Pro on Together AI
Endpoint	deepseek-ai/DeepSeek-V4-Pro
Architecture	1.6T-parameter MoE
Activated parameters	49B
Context on Together AI	512K tokens
Model-level context	1M tokens
Reasoning modes	Non-Think, Think High, Think Max
Deployment	Serverless, Monthly Reserved
Input price	$2.10 / 1M tokens
Cached input price	$0.20 / 1M tokens
Output price	$4.40 / 1M tokens
Best-fit workloads	Code agents, document intelligence, long-context agents, research synthesis

Built for long-context reasoning

DeepSeek V4 Pro is built for workloads where the model needs to reason over more than a short prompt: large repositories, long technical documents, dense retrieval bundles, tool-call histories, and research corpora.

DeepSeek V4 Pro supports million-token context at the model level; on Together AI, it is currently available with a 512K-token context window. That distinction matters because model capability and deployed serving profile are not always the same thing. Together AI is launching DeepSeek V4 Pro with a context window designed for reliable production serving, while still giving teams enough room for serious long-context workloads.

The architecture also matters because long context is not only a product spec. As context grows, serving cost, memory pressure, KV cache usage, latency, and concurrency all become part of the system design. DeepSeek V4 Pro uses hybrid attention, combining Compressed Sparse Attention and Heavily Compressed Attention, with DeepSeek reporting 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek V3.2 at million-token context.

Choose reasoning effort by workload

DeepSeek V4 Pro supports three reasoning modes, so teams can match reasoning depth to task difficulty instead of treating every request the same.

Mode	Use when	Tradeoff
Non-Think	Extraction, classification, simple Q&A, routine responses	Fastest path for lower-complexity tasks
Think High	Code planning, document analysis, multi-step reasoning	More reasoning depth for complex work
Think Max	Hard debugging, deep research synthesis, agentic decision points	Maximum reasoning effort; expect higher latency and token usage

A document assistant might use Non-Think for simple extraction, Think High for conflict analysis across policies, and Think Max only when the model needs to reason through a difficult decision. A code agent might use Think High for planning a migration and Think Max for debugging a subtle cross-service failure.

DeepSeek reports benchmark results across coding, reasoning, long-context, and agentic tasks, including 93.5% LiveCodeBench, 90.1% GPQA Diamond, 80.6% SWE-bench Verified, 83.5% MRCR 1M, and 62.0% CorpusQA 1M.

Make repeated long-context queries cheaper with cached input pricing

Long-context systems often reuse the same large context across multiple questions: a repository snapshot, a document bundle, a policy archive, a retrieval payload, or a long agent trace. Cached input pricing makes those repeated workloads more practical.

DeepSeek V4 Pro is priced at \$2.10 / 1M input tokens, with cached input at \$0.20 / 1M tokens and output at \$4.40 / 1M tokens. That represents a 90% cost reduction for reused context, which matters when the expensive part of the request is a stable block of text that gets reused across follow-up analysis.

Example pattern:

Load a large stable context, such as a 300K-token repo summary, contract set, or policy archive.
Ask several follow-up questions over that same context.
Use cached input pricing where applicable to drastically reduce the cost of repeated analysis.

Workload patterns

Code agents

Use DeepSeek V4 Pro when an agent needs to reason across repository slices, issue traces, internal documentation, prior tool calls, and proposed patches. Think High or Think Max is most useful for planning changes, debugging failures, or resolving cross-file dependencies.

Document intelligence

Use long context for contracts, policy sets, technical manuals, or research collections that need to be compared in one request. Non-Think can handle extraction and simple Q&A; Think High is better for conflict analysis, interpretation, and synthesis.

Long-context agent traces

Use DeepSeek V4 Pro to inspect long tool-call histories, intermediate results, and execution traces. Higher reasoning modes are most useful at decision points: when the agent needs to decide whether to continue, call another tool, revise a plan, or stop.

Research synthesis

Use DeepSeek V4 Pro for workflows that combine papers, notes, benchmark reports, retrieved documents, and prior analysis. Cached input pricing is especially useful when the same evidence set is reused across multiple questions.

Start serverless, move to reserved capacity

DeepSeek V4 Pro is available on Together AI Serverless Inference and Monthly Reserved infrastructure. Serverless is the right starting point for evaluation, development, and variable traffic. Monthly Reserved is better for steadier production demand where teams need more predictable capacity and cost control.

For long-context workloads, the deployment path matters. Teams are not only choosing a model; they are choosing how to manage throughput, concurrency, latency, KV cache pressure, and cost as context sizes grow. Together AI gives teams a path from evaluation to production without standing up the serving stack themselves.

Try it now

DeepSeek-V4 Pro is available today on Together AI Serverless Inference and Dedicated Endpoints.

    
from together import Together

client = Together()

stream = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Pro",
    messages=[
        {
            "role": "user",
            "content": "Prove that the square root of 2 is irrational.",
        }
    ],
    stream=True,
)

for chunk in stream:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta

    if hasattr(delta, "reasoning") and delta.reasoning:
        print(delta.reasoning, end="", flush=True)
    if hasattr(delta, "content") and delta.content:
        print(delta.content, end="", flush=True)

Start with Serverless Inference for development and evaluation. For production workloads that require full 1M context, reserved capacity, workload isolation, or more predictable throughput, contact sales to deploy DeepSeek-V4 Pro on Together AI Dedicated Inference.

Get started

→ Follow our DeepSeek-V4 quickstart to get up and running in minutes

→ View the DeepSeek-V4 Pro Model Page

→ Try DeepSeek-V4 Pro in the Playground

→ Contact Sales for Dedicated Inference deployment and volume pricing

‍