NVIDIA Nemotron 3 Ultra

Open frontier reasoning model for long-running autonomous agents and complex workflows

Try now

read docs

About model

NVIDIA Nemotron 3 Ultra is a 550B parameter (55B activated) open reasoning model built for long-running autonomous agents handling orchestration and complex tasks across coding, deep research, and enterprise workflows. Its hybrid Mamba-Transformer MoE architecture combines Latent MoE — which calls 4 experts at the inference cost of 1 — with Multi-Token Prediction for reduced generation time on long sequences, and Token Budget support for optimal accuracy with minimum reasoning token output. The model supports a 1M token context window and is fully open under the NVIDIA Open Model License with open weights, training data, and recipes.

Total Parameters (55B Activated)

550B

Hybrid Mamba-Transformer MoE with Latent MoE

Context Window

Sustained reasoning across long-running agent sessions

Weights + Data + Recipes

Open

NVIDIA Open Model License for enterprise customization

Model key capabilities

Coding Agents: Architectural planning, complex multi-file refactors, and error recovery across large codebases — handling the hardest reasoning calls within end-to-end coding agent workflows
Deep Research: Sustained synthesis across large source sets, resolving contradictions and proposing novel hypotheses within research agent loops
Enterprise & EDA Workflows: Complex reasoning steps within persistent, tool-using agent loops across security, regulatory, clinical, and chip design domains including RTL generation and design verification
Efficient Architecture: Latent MoE runs 4 experts at the cost of 1, Multi-Token Prediction reduces generation time for long sequences, and Token Budget optimizes reasoning token usage — all within a 1M token context window

API usage

cURL
Python
Typescript

Endpoint:

nvidia/nemotron-3-ultra-550b-a55b

curl -X POST "https://api.together.xyz/v1/chat/completions" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-ultra-550b-a55b",
    "messages": [
      {
        "role": "user",
        "content": "What are some fun things to do in New York?"
      }
    ]
}'

from together import Together

client = Together()

response = client.chat.completions.create(
  model="nvidia/nemotron-3-ultra-550b-a55b",
  messages=[
    {
      "role": "user",
      "content": "What are some fun things to do in New York?"
    }
  ]
)
print(response.choices[0].message.content)

import Together from 'together-ai';
const together = new Together();

const completion = await together.chat.completions.create({
  model: 'nvidia/nemotron-3-ultra-550b-a55b',
  messages: [
    {
      role: 'user',
      content: 'What are some fun things to do in New York?'
     }
  ],
});

console.log(completion.choices[0].message.content);

Model card
Architecture Overview:
• 550B total parameter MoE with 55B parameters activated per token
• Hybrid Mamba-Transformer MoE architecture
• Latent MoE: runs 4 experts at the inference cost of 1, improving intelligence at no added compute
• Multi-Token Prediction (MTP): predicts multiple future tokens per forward pass, reducing generation time for long sequences
• Token Budget: optimizes for accuracy with minimum reasoning token generation
• 1M token context for sustained agent sessions and cross-document reasoning
• NVFP4 precision optimized for Blackwell; FP8 and BF16 also supported

Training Methodology:
• Multi-environment RL training across agentic environments for reasoning, tool calling, and instruction following
• Trained on NVIDIA-generated high-quality synthetic data from frontier open reasoning models
• Open training recipes published for domain-specific customization

Performance Characteristics:
• Leading accuracy on the Artificial Analysis Intelligence Index among open models
• Strong performance across reasoning, coding, and agentic task benchmarks
• Token Budget support enables predictable inference cost on long-horizon tasks
‍
Prompting
Together AI API Access:
• Access NVIDIA Nemotron 3 Ultra via Together AI APIs using the endpoint nvidia/nemotron-3-ultra-550b-a55b
• Authenticate using your Together AI API key in request headers
• Supports tool calling, Token Budget for cost-controlled reasoning, and extended context up to 1M tokens
• Available on serverless and dedicated infrastructure
‍
Applications & use cases
Coding Agents:
• Architectural planning and design decisions within week-long autonomous coding sessions
• Complex multi-file refactors and end-to-end issue resolution across large codebases
• Error recovery and iterative debugging within persistent agent loops

Deep Research & Search:
• Cross-referencing and synthesis across large source sets within sustained research agent loops
• Contradiction resolution and novel hypothesis generation at the final synthesis stage
• Long-context reasoning with 1M token window for extensive document sets

Enterprise Agent Workflows:
• Security alert triage, regulatory filing ingestion, and clinical trial orchestration within persistent tool-using loops
• Complex reasoning steps within multi-step enterprise automation across industries

EDA & Chip Design:
• RTL generation from specifications and verification across thousands of constraints
• Failure analysis and cross-block dependency resolution within chip design agent workflows
• Design-to-manufacturing sign-off orchestration
‍

Related models

Model specifications

Model data

Model provider
NVIDIA
Type
Reasoning
Chat
Code
Main use cases
Reasoning
Features
Function Calling
JSON Mode
Deployment
Serverless
Monthly Reserved
Endpoint
nvidia/nemotron-3-ultra-550b-a55b
Parameters
550B
Activated parameters
55B
Context length
1M
Input price
$0.60 / 1M tokens
$0.20 (cached)/1M
Output price
$3.60 / 1M tokens
Input modalities
Text
Output modalities
Text

Released
May 30, 2026
Category
Chat

Run in Playground

Quickstart docs

Deploy model

NVIDIA Nemotron 3 Ultra

About model

API usage

Model card

Prompting

Applications & use cases