Models / NVIDIA
Reasoning

NVIDIA Nemotron 3 Ultra

Open frontier reasoning model for long-running autonomous agents and complex workflows

About model

NVIDIA Nemotron 3 Ultra is a 550B parameter (55B activated) open reasoning model built for long-running autonomous agents handling orchestration and complex tasks across coding, deep research, and enterprise workflows. Its hybrid Mamba-Transformer MoE architecture combines Latent MoE — which calls 4 experts at the inference cost of 1 — with Multi-Token Prediction for reduced generation time on long sequences, and Token Budget support for optimal accuracy with minimum reasoning token output. The model supports a 1M token context window and is fully open under the NVIDIA Open Model License with open weights, training data, and recipes.

Total Parameters (55B Activated)

550B

Hybrid Mamba-Transformer MoE with Latent MoE

Context Window

1M

Sustained reasoning across long-running agent sessions

Weights + Data + Recipes

Open

NVIDIA Open Model License for enterprise customization

Model key capabilities
  • Coding Agents: Architectural planning, complex multi-file refactors, and error recovery across large codebases — handling the hardest reasoning calls within end-to-end coding agent workflows
  • Deep Research: Sustained synthesis across large source sets, resolving contradictions and proposing novel hypotheses within research agent loops
  • Enterprise & EDA Workflows: Complex reasoning steps within persistent, tool-using agent loops across security, regulatory, clinical, and chip design domains including RTL generation and design verification
  • Efficient Architecture: Latent MoE runs 4 experts at the cost of 1, Multi-Token Prediction reduces generation time for long sequences, and Token Budget optimizes reasoning token usage — all within a 1M token context window
  • API usage

    • cURL
    • Python
    • Typescript

    Endpoint:

    nvidia/nemotron-3-ultra-550b-a55b

    curl -X POST https://api.together.xyz/v1/chat/completions \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $TOGETHER_API_KEY" \
      -d '{
        "model": "nvidia/nemotron-3-ultra-550b-a55b",
        "messages": [{
          "role": "user",
          "content": "Given two binary strings `a` and `b`, return their sum as a binary string"
        }]
      }'
    
    from together import Together
    
    client = Together()
    response = client.chat.completions.create(
      model="nvidia/nemotron-3-ultra-550b-a55b",
      messages=[
      	{
    	    "role": "user", 
          "content": "Given two binary strings `a` and `b`, return their sum as a binary string"
        }
     ],
    )
    
    print(response.choices[0].message.content)
    
    
    import Together from "together-ai";
    
    const together = new Together();
    
    async function main() {
      const response = await together.chat.completions.create({
        model: "nvidia/nemotron-3-ultra-550b-a55b",
        messages: [{
          role: "user",
          content: "Given two binary strings `a` and `b`, return their sum as a binary string"
        }]
      });
      
      console.log(response.choices[0]?.message?.content);
    }
    
    main();
    
    
  • Model card

    Architecture Overview:
    • 550B total parameter MoE with 55B parameters activated per token
    • Hybrid Mamba-Transformer MoE architecture
    • Latent MoE: runs 4 experts at the inference cost of 1, improving intelligence at no added compute
    • Multi-Token Prediction (MTP): predicts multiple future tokens per forward pass, reducing generation time for long sequences
    • Token Budget: optimizes for accuracy with minimum reasoning token generation
    • 1M token context for sustained agent sessions and cross-document reasoning
    • NVFP4 precision optimized for Blackwell; FP8 and BF16 also supported

    Training Methodology:
    • Multi-environment RL training across agentic environments for reasoning, tool calling, and instruction following
    • Trained on NVIDIA-generated high-quality synthetic data from frontier open reasoning models
    • Open training recipes published for domain-specific customization

    Performance Characteristics:
    • Leading accuracy on the Artificial Analysis Intelligence Index among open models
    • Strong performance across reasoning, coding, and agentic task benchmarks
    • Token Budget support enables predictable inference cost on long-horizon tasks

  • Prompting

    Together AI API Access:
    • Access NVIDIA Nemotron 3 Ultra via Together AI APIs using the endpoint nvidia/nemotron-3-ultra-550b-a55b
    • Authenticate using your Together AI API key in request headers
    • Supports tool calling, Token Budget for cost-controlled reasoning, and extended context up to 1M tokens
    • Available on serverless and dedicated infrastructure

  • Applications & use cases

    Coding Agents:
    • Architectural planning and design decisions within week-long autonomous coding sessions
    • Complex multi-file refactors and end-to-end issue resolution across large codebases
    • Error recovery and iterative debugging within persistent agent loops

    Deep Research & Search:
    • Cross-referencing and synthesis across large source sets within sustained research agent loops
    • Contradiction resolution and novel hypothesis generation at the final synthesis stage
    • Long-context reasoning with 1M token window for extensive document sets

    Enterprise Agent Workflows:
    • Security alert triage, regulatory filing ingestion, and clinical trial orchestration within persistent tool-using loops
    • Complex reasoning steps within multi-step enterprise automation across industries

    EDA & Chip Design:
    • RTL generation from specifications and verification across thousands of constraints
    • Failure analysis and cross-block dependency resolution within chip design agent workflows
    • Design-to-manufacturing sign-off orchestration

Related models
No items found.
  • Model provider
    NVIDIA
  • Type
    Reasoning
  • Main use cases
    Reasoning
  • Features
    Function Calling
    JSON Mode
  • Deployment
    Serverless
    On-Demand Dedicated
    Monthly Reserved
  • Parameters
    550B
  • Activated parameters
    55B
  • Context length
    1M
  • Input price

    $0.60 / 1M tokens

    $0.20 (cached)/1M

  • Output price

    $3.60 / 1M tokens

  • Input modalities
    Text
  • Output modalities
    Text
  • Released
    May 30, 2026
  • Category
    Code