NVIDIA Nemotron 3 Ultra
Open frontier reasoning model for long-running autonomous agents and complex workflows
About model
NVIDIA Nemotron 3 Ultra is a 550B parameter (55B activated) open reasoning model built for long-running autonomous agents handling orchestration and complex tasks across coding, deep research, and enterprise workflows. Its hybrid Mamba-Transformer MoE architecture combines Latent MoE — which calls 4 experts at the inference cost of 1 — with Multi-Token Prediction for reduced generation time on long sequences, and Token Budget support for optimal accuracy with minimum reasoning token output. The model supports a 1M token context window and is fully open under the NVIDIA Open Model License with open weights, training data, and recipes.
550B
Hybrid Mamba-Transformer MoE with Latent MoE
1M
Sustained reasoning across long-running agent sessions
Open
NVIDIA Open Model License for enterprise customization
- Coding Agents: Architectural planning, complex multi-file refactors, and error recovery across large codebases — handling the hardest reasoning calls within end-to-end coding agent workflows
- Deep Research: Sustained synthesis across large source sets, resolving contradictions and proposing novel hypotheses within research agent loops
- Enterprise & EDA Workflows: Complex reasoning steps within persistent, tool-using agent loops across security, regulatory, clinical, and chip design domains including RTL generation and design verification
- Efficient Architecture: Latent MoE runs 4 experts at the cost of 1, Multi-Token Prediction reduces generation time for long sequences, and Token Budget optimizes reasoning token usage — all within a 1M token context window
API usage
Endpoint:
Model card
Architecture Overview:
• 550B total parameter MoE with 55B parameters activated per token
• Hybrid Mamba-Transformer MoE architecture
• Latent MoE: runs 4 experts at the inference cost of 1, improving intelligence at no added compute
• Multi-Token Prediction (MTP): predicts multiple future tokens per forward pass, reducing generation time for long sequences
• Token Budget: optimizes for accuracy with minimum reasoning token generation
• 1M token context for sustained agent sessions and cross-document reasoning
• NVFP4 precision optimized for Blackwell; FP8 and BF16 also supported
Training Methodology:
• Multi-environment RL training across agentic environments for reasoning, tool calling, and instruction following
• Trained on NVIDIA-generated high-quality synthetic data from frontier open reasoning models
• Open training recipes published for domain-specific customization
Performance Characteristics:
• Leading accuracy on the Artificial Analysis Intelligence Index among open models
• Strong performance across reasoning, coding, and agentic task benchmarks
• Token Budget support enables predictable inference cost on long-horizon tasks
Prompting
Together AI API Access:
• Access NVIDIA Nemotron 3 Ultra via Together AI APIs using the endpoint nvidia/nemotron-3-ultra-550b-a55b
• Authenticate using your Together AI API key in request headers
• Supports tool calling, Token Budget for cost-controlled reasoning, and extended context up to 1M tokens
• Available on serverless and dedicated infrastructure
Applications & use cases
Coding Agents:
• Architectural planning and design decisions within week-long autonomous coding sessions
• Complex multi-file refactors and end-to-end issue resolution across large codebases
• Error recovery and iterative debugging within persistent agent loops
Deep Research & Search:
• Cross-referencing and synthesis across large source sets within sustained research agent loops
• Contradiction resolution and novel hypothesis generation at the final synthesis stage
• Long-context reasoning with 1M token window for extensive document sets
Enterprise Agent Workflows:
• Security alert triage, regulatory filing ingestion, and clinical trial orchestration within persistent tool-using loops
• Complex reasoning steps within multi-step enterprise automation across industries
EDA & Chip Design:
• RTL generation from specifications and verification across thousands of constraints
• Failure analysis and cross-block dependency resolution within chip design agent workflows
• Design-to-manufacturing sign-off orchestration
- Model providerNVIDIA
- TypeReasoning
- Main use casesReasoning
- FeaturesFunction CallingJSON Mode
- DeploymentServerlessOn-Demand DedicatedMonthly Reserved
- Parameters550B
- Activated parameters55B
- Context length1M
- Input price
$0.60 / 1M tokens
$0.20 (cached)/1M
- Output price
$3.60 / 1M tokens
- Input modalitiesText
- Output modalitiesText
- ReleasedMay 30, 2026
- CategoryCode