NVIDIA Nemotron 3 Super
Hybrid MoE model optimized for multi-agent workflows on single GPU deployment.
About model
NVIDIA Nemotron 3 Super is a hybrid MoE model with Mamba-transformer architecture designed for highest compute efficiency and accuracy in multi-agent applications. With 120B total parameters (12B activated), the model is optimized for running many collaborating agents per application on a single GPU, delivering high accuracy for reasoning, tool calling, and instruction following. The hybrid Mamba-transformer architecture delivers significantly higher token generation throughput, enabling faster thinking and higher accuracy in the same time. Fully open-source with open weights, data, and recipes, Nemotron 3 Super achieves leading accuracy across GPQA Diamond, AIME 2025, LiveCodeBench, IFBench, and BFCL benchmarks on Together AI's production infrastructure.
1M
Extended context for long-horizon workflows
1M
vs best open model
12B
From 120B total MoE architecture
- High Efficiency: Mamba-transformer MoE architecture with 50% higher token generation compared to best open model today (as per Artificial Analysis)
- Multi-Agent Optimization: Combines Latent MoE for cost-efficient multi-expert inference, multi-environment RL training for leading accuracy, and a 1M token context length
- Fully Open-Source: Open weights (NVIDIA license), open data (synthetic from frontier models), open recipes for full transparency
- Production-Ready Infrastructure: 99.9% SLA, 1M context, available on Together AI serverless and dedicated infrastructure
Model card
Architecture Overview:
• Hybrid Mixture of Experts (MoE) with Mamba-transformer architecture
• 120B total parameters with 12B activated per forward pass via sparse MoE routing
• Optimized for running many collaborating agents per application on single GPU
• Hybrid Mamba-transformer delivers significantly higher token generation throughput vs pure transformer
• Thinking budget optimization avoids overthinking and ensures predictable inference costs
• 1M token context length for processing extensive codebases and long-horizon workflows
• Single GPU deployment: 1×B200, 1×GB200, 2×H100, 1×H200, 4×A100, 4×L40S, 1×DGX Spark, 1×RTX 6000
Training Methodology:
• Trained with NVIDIA-curated high-quality synthetic data from expert reasoning models
• Reinforcement learning alignment to reason like humans across various categories
• Open data: fully transparent synthetic dataset generated using frontier open reasoning models
• Open recipes: NVIDIA development techniques and tools for customization and optimization
• Post-training optimizations for powerful, transparent, and adaptable deployment
Performance Characteristics:
• Leading accuracy across GPQA Diamond, AIME 2025, LiveCodeBench, IFBench, BFCL benchmarks
• Highest compute efficiency via hybrid Mamba-transformer architecture
• MoE architecture reduces compute and meets stringent latency requirements
• Thinking budget optimizes for lower, predictable inference cost
• Multi-agent optimization: high accuracy for reasoning, tool calling, instruction following
• Significantly higher token generation throughput enabling faster thinking
Applications & use cases
Multi-Agent Software Development:
• Optimized for running many collaborating agents per application on single GPU
• Code summarization, generation, refactoring across multiple agent workflows
• Leading accuracy on LiveCodeBench for competitive programming tasks
• High accuracy for tool calling and instruction following in complex coding workflows
• 1M context supporting entire codebases and long-horizon development tasks
Financial Services Automation:
• Accelerate loan processing by extracting data and analyzing income patterns
• Detect fraudulent operations reducing cycle times and risk
• Multi-agent workflows for comprehensive financial analysis
• Thinking budget ensures predictable costs for high-volume operations
Cybersecurity Operations:
• Automatically triage vulnerabilities with multi-agent coordination
• Perform in-depth malware analysis across security tools
• Proactively hunt for security threats with agentic workflows
• High accuracy for instruction following in security-critical operations
Search & Productivity Agents:
• Leading accuracy on IFBench and BFCL for instruction following and function calling
• Multi-agent search workflows to increase productivity
• Thinking budget optimization for cost-effective at-scale deployment
• 1M context for processing extensive research materials
Retail Optimization:
• Optimize inventory management with multi-agent coordination
• Real-time personalized product recommendations and support
• Enhance in-store service with collaborative agent systems
• Predictable inference costs via thinking budget optimization
Open-Source Customization:
• Open weights: NVIDIA license for enterprise flexibility and data control
• Open data: fully transparent NVIDIA-generated synthetic training data
• Open recipes: development techniques for building custom reasoning models
• Deploy anywhere: laptop to cloud via NVIDIA NIM
• Full transparency and adaptability for researchers and enterprises
- Model ProviderNVIDIA
- TypeCodeChat
- Main use casesChatReasoningCoding Agents
- DeploymentOn-Demand DedicatedMonthly Reserved
- Parameters120B
- Context Length1M
- Input modalitiesText
- Output modalitiesText
- CategoryChat
