NVIDIA Nemotron 3 Super

Hybrid MoE model optimized for multi-agent workflows on single GPU deployment.

About model

NVIDIA Nemotron 3 Super is a hybrid MoE model with Mamba-transformer architecture designed for highest compute efficiency and accuracy in multi-agent applications. With 120B total parameters (12B activated), the model is optimized for running many collaborating agents per application on a single GPU, delivering high accuracy for reasoning, tool calling, and instruction following. The hybrid Mamba-transformer architecture delivers significantly higher token generation throughput, enabling faster thinking and higher accuracy in the same time. Fully open-source with open weights, data, and recipes, Nemotron 3 Super achieves leading accuracy across GPQA Diamond, AIME 2025, LiveCodeBench, IFBench, and BFCL benchmarks on Together AI's production infrastructure.

Context Length

Extended context for long-horizon workflows

Faster Token Generation

50%

vs best open model

Active Parameters

12B

From 120B total MoE architecture

Model key capabilities

High Efficiency: Mamba-transformer MoE architecture with 50% higher token generation compared to best open model today (as per Artificial Analysis)
Multi-Agent Optimization: Combines Latent MoE for cost-efficient multi-expert inference, multi-environment RL training for leading accuracy, and a 1M token context length
Fully Open-Source: Open weights (NVIDIA license), open data (synthetic from frontier models), open recipes for full transparency
Production-Ready Infrastructure: 99.9% SLA, 1M context, available on Together AI serverless and dedicated infrastructure

Quickstart guides

RAG

Building a RAG Workflow

Agents

Agent Workflows

Apps

Next.js Chat Quickstart

Model card
Architecture Overview:
• Hybrid Mixture of Experts (MoE) with Mamba-transformer architecture
• 120B total parameters with 12B activated per forward pass via sparse MoE routing
• Optimized for running many collaborating agents per application on single GPU
• Hybrid Mamba-transformer delivers significantly higher token generation throughput vs pure transformer
• Thinking budget optimization avoids overthinking and ensures predictable inference costs
• 1M token context length for processing extensive codebases and long-horizon workflows
• Single NVIDIA GPU deployment: 1×B200, 1×GB200, 2×H100, 1×H200, 4×A100, 4×L40S, 1×DGX Spark, 1×RTX 6000

Training Methodology:
• Trained with NVIDIA-curated high-quality synthetic data from expert reasoning models
• Reinforcement learning alignment to reason like humans across various categories
• Open data: fully transparent synthetic dataset generated using frontier open reasoning models
• Open recipes: NVIDIA development techniques and tools for customization and optimization
• Post-training optimizations for powerful, transparent, and adaptable deployment

Performance Characteristics:
• Leading accuracy across GPQA Diamond, AIME 2025, LiveCodeBench, IFBench, BFCL benchmarks
• Highest compute efficiency via hybrid Mamba-transformer architecture
• MoE architecture reduces compute and meets stringent latency requirements
• Thinking budget optimizes for lower, predictable inference cost
• Multi-agent optimization: high accuracy for reasoning, tool calling, instruction following
• Significantly higher token generation throughput enabling faster thinking
‍
Applications & use cases
Multi-Agent Software Development:
• Optimized for running many collaborating agents per application on single GPU
• Code summarization, generation, refactoring across multiple agent workflows
• Leading accuracy on LiveCodeBench for competitive programming tasks
• High accuracy for tool calling and instruction following in complex coding workflows
• 1M context supporting entire codebases and long-horizon development tasks

Financial Services Automation:
• Accelerate loan processing by extracting data and analyzing income patterns
• Detect fraudulent operations reducing cycle times and risk
• Multi-agent workflows for comprehensive financial analysis
• Thinking budget ensures predictable costs for high-volume operations

Cybersecurity Operations:
• Automatically triage vulnerabilities with multi-agent coordination
• Perform in-depth malware analysis across security tools
• Proactively hunt for security threats with agentic workflows
• High accuracy for instruction following in security-critical operations

Search & Productivity Agents:
• Leading accuracy on IFBench and BFCL for instruction following and function calling
• Multi-agent search workflows to increase productivity
• Thinking budget optimization for cost-effective at-scale deployment
• 1M context for processing extensive research materials

Retail Optimization:
• Optimize inventory management with multi-agent coordination
• Real-time personalized product recommendations and support
• Enhance in-store service with collaborative agent systems
• Predictable inference costs via thinking budget optimization

Open-Source Customization:
• Open weights: NVIDIA license for enterprise flexibility and data control
• Open data: fully transparent NVIDIA-generated synthetic training data
• Open recipes: development techniques for building custom reasoning models
• Deploy anywhere: laptop to cloud via NVIDIA NIM microservices
• Full transparency and adaptability for researchers and enterprises
‍

Related models

Model specifications

Model data

Model provider
NVIDIA
Type
Code
Chat
Main use cases
Chat
Reasoning
Coding Agents
Deployment
On-Demand Dedicated
Monthly Reserved
Parameters
120B
Activated parameters
12B
Context length
1M
Input modalities
Text
Output modalities
Text

Category
Chat

Quickstart docs

Deploy model

NVIDIA Nemotron 3 Super

About model

Model card

Applications & use cases