Models / NVIDIANemotron Nano / / NVIDIA-Nemotron-Nano-9B-v2 API
NVIDIA-Nemotron-Nano-9B-v2 API

This model is not currently supported on Together AI.
Visit our Models page to view all the latest models.
Unified Reasoning & Chat Model: NVIDIA-Nemotron-Nano-9B-v2 is a cutting-edge large language model designed as a unified solution for both reasoning and non-reasoning tasks. Built with a hybrid Mamba2-Transformer architecture, it delivers exceptional performance on complex reasoning benchmarks while maintaining efficiency for everyday conversational AI applications.
Controllable Intelligence: The model features unique runtime reasoning budget control, allowing developers to balance accuracy and response time based on their specific use case. Whether you need deep analytical thinking or quick responses, Nemotron Nano 2 adapts to your requirements.
Multilingual & Production-Ready: Supporting English, German, Spanish, French, Italian, and Japanese, this model is ready for commercial deployment with comprehensive API integration options via NVIDIA's platform and Hugging Face.
NVIDIA-Nemotron-Nano-9B-v2 API Usage
Endpoint
How to use NVIDIA-Nemotron-Nano-9B-v2
Model details
Architecture Overview:
• Hybrid Architecture: Primarily Mamba-2 and MLP layers combined with just four Attention layers (Nemotron-H design)
• Context Window: Supports up to 128K tokens for extended document processing
• Model Size: 9 billion parameters trained from scratch by NVIDIA
• Training Period: June 2025 - August 2025 with data cutoff of September 2024
• Training Infrastructure: Built using Megatron-LM and NeMo-RL frameworks
Training Methodology:
• Pretraining Corpus: Over 20 trillion tokens of high-quality curated and synthetically-generated data
• Data Sources: English Common Crawl (3.36T tokens), Multilingual Common Crawl (812.7B tokens), GitHub Crawl (747.4B tokens)
• Synthetic Data: Leverages reasoning traces from DeepSeek R1, Qwen3-235B-A22B, Nemotron 4 340B, and other state-of-the-art models
• Domain Coverage: Code (43 programming languages), legal, math, science, finance, multilingual text (15 languages)
• Post-Training: Specialized reasoning-focused instruction tuning with synthetic reasoning traces
Performance Characteristics:
• Reasoning Benchmarks: 72.1% AIME25, 97.8% MATH500, 64.0% GPQA, 71.1% LiveCodeBench
• Instruction Following: 90.3% IFEval (Instruction Strict), 66.9% BFCL v3 for function calling
• Long Context: 78.9% RULER at 128K context length
• Reasoning Modes: Supports both "reasoning-on" (with tags) and "reasoning-off" modes via system prompts
• Runtime Control: Unique thinking budget control allows specification of maximum reasoning tokens
• Multilingual: Supports English, German, Spanish, French, Italian, and Japanese with quality improvements from Qwen integration
Prompting NVIDIA-Nemotron-Nano-9B-v2
Applications & Use Cases
Primary Use Cases:
• Mathematical Reasoning: Exceptional performance on AIME, MATH500, and competition-level problems
• Scientific Analysis: Strong GPQA scores for graduate-level science questions
• Code Generation: 71.1% LiveCodeBench with support for 43 programming languages
• AI Agent Systems: Controllable reasoning makes it ideal for multi-step agent workflows
• Customer Support: Reasoning budget control enables balance between accuracy and response time
• Function Calling: Native tool-calling support via BFCL v3 benchmark validation
Enterprise Applications:
• RAG Systems: 128K context window supports extensive document retrieval and analysis
• Chatbots: Multilingual support (6 languages) for global customer engagement
• Content Moderation: Trained with Nemotron Content Safety Dataset V2 for safe outputs
• Educational Tools: Mathematical and scientific reasoning capabilities for tutoring applications
• Research Assistance: Long-context support for analyzing papers, reports, and technical documents
Edge & Latency-Sensitive Deployments:
• Efficient Architecture: Mamba2-Transformer hybrid enables faster inference than pure attention models
• Variable Compute: Thinking budget control optimizes for time-critical applications
• Streaming Support: Compatible with vLLM streaming for real-time response generation
• Hardware Optimization: Optimized for NVIDIA GPUs (A10G, A100, H100) with TensorRT-LLM support