Models / NVIDIA
Code
Chat

NVIDIA Nemotron 3 Super

Hybrid MoE model optimized for multi-agent workflows on single GPU deployment.

About model

NVIDIA Nemotron 3 Super is a hybrid MoE model with Mamba-transformer architecture designed for highest compute efficiency and accuracy in multi-agent applications. With 120B total parameters (12B activated), the model is optimized for running many collaborating agents per application on a single GPU, delivering high accuracy for reasoning, tool calling, and instruction following. The hybrid Mamba-transformer architecture delivers significantly higher token generation throughput, enabling faster thinking and higher accuracy in the same time. Fully open-source with open weights, data, and recipes, Nemotron 3 Super achieves leading accuracy across GPQA Diamond, AIME 2025, LiveCodeBench, IFBench, and BFCL benchmarks on Together AI's production infrastructure.

Context Length

1M

Extended context for long-horizon workflows

Faster Token Generation

1M

vs best open model

Active Parameters

12B

From 120B total MoE architecture

Model key capabilities
  • High Efficiency: Mamba-transformer MoE architecture with 50% higher token generation compared to best open model today (as per Artificial Analysis)
  • Multi-Agent Optimization: Combines Latent MoE for cost-efficient multi-expert inference, multi-environment RL training for leading accuracy, and a 1M token context length
  • Fully Open-Source: Open weights (NVIDIA license), open data (synthetic from frontier models), open recipes for full transparency
  • Production-Ready Infrastructure: 99.9% SLA, 1M context, available on Together AI serverless and dedicated infrastructure
  • Model card

    Architecture Overview:
    • Hybrid Mixture of Experts (MoE) with Mamba-transformer architecture
    • 120B total parameters with 12B activated per forward pass via sparse MoE routing
    • Optimized for running many collaborating agents per application on single GPU
    • Hybrid Mamba-transformer delivers significantly higher token generation throughput vs pure transformer
    • Thinking budget optimization avoids overthinking and ensures predictable inference costs
    • 1M token context length for processing extensive codebases and long-horizon workflows
    • Single GPU deployment: 1×B200, 1×GB200, 2×H100, 1×H200, 4×A100, 4×L40S, 1×DGX Spark, 1×RTX 6000

    Training Methodology:
    • Trained with NVIDIA-curated high-quality synthetic data from expert reasoning models
    • Reinforcement learning alignment to reason like humans across various categories
    • Open data: fully transparent synthetic dataset generated using frontier open reasoning models
    • Open recipes: NVIDIA development techniques and tools for customization and optimization
    • Post-training optimizations for powerful, transparent, and adaptable deployment

    Performance Characteristics:
    • Leading accuracy across GPQA Diamond, AIME 2025, LiveCodeBench, IFBench, BFCL benchmarks
    • Highest compute efficiency via hybrid Mamba-transformer architecture
    • MoE architecture reduces compute and meets stringent latency requirements
    • Thinking budget optimizes for lower, predictable inference cost
    • Multi-agent optimization: high accuracy for reasoning, tool calling, instruction following
    • Significantly higher token generation throughput enabling faster thinking

  • Applications & use cases

    Multi-Agent Software Development:
    • Optimized for running many collaborating agents per application on single GPU
    • Code summarization, generation, refactoring across multiple agent workflows
    • Leading accuracy on LiveCodeBench for competitive programming tasks
    • High accuracy for tool calling and instruction following in complex coding workflows
    • 1M context supporting entire codebases and long-horizon development tasks

    Financial Services Automation:
    • Accelerate loan processing by extracting data and analyzing income patterns
    • Detect fraudulent operations reducing cycle times and risk
    • Multi-agent workflows for comprehensive financial analysis
    • Thinking budget ensures predictable costs for high-volume operations

    Cybersecurity Operations:
    • Automatically triage vulnerabilities with multi-agent coordination
    • Perform in-depth malware analysis across security tools
    • Proactively hunt for security threats with agentic workflows
    • High accuracy for instruction following in security-critical operations

    Search & Productivity Agents:
    • Leading accuracy on IFBench and BFCL for instruction following and function calling
    • Multi-agent search workflows to increase productivity
    • Thinking budget optimization for cost-effective at-scale deployment
    • 1M context for processing extensive research materials

    Retail Optimization:
    • Optimize inventory management with multi-agent coordination
    • Real-time personalized product recommendations and support
    • Enhance in-store service with collaborative agent systems
    • Predictable inference costs via thinking budget optimization

    Open-Source Customization:
    • Open weights: NVIDIA license for enterprise flexibility and data control
    • Open data: fully transparent NVIDIA-generated synthetic training data
    • Open recipes: development techniques for building custom reasoning models
    • Deploy anywhere: laptop to cloud via NVIDIA NIM
    • Full transparency and adaptability for researchers and enterprises

Related models
  • Model Provider
    NVIDIA
  • Type
    Code
    Chat
  • Main use cases
    Chat
    Reasoning
    Coding Agents
  • Deployment
    On-Demand Dedicated
    Monthly Reserved
  • Parameters
    120B
  • Context Length
    1M
  • Input modalities
    Text
  • Output modalities
    Text
  • Category
    Chat