Models / NVIDIA
Chat

NVIDIA-Nemotron-Nano-9B-v2

Advanced reasoning model with controllable thinking budget

About model

Unified Reasoning & Chat Model: NVIDIA-Nemotron-Nano-9B-v2 is a cutting-edge large language model designed as a unified solution for both reasoning and non-reasoning tasks. Built with a hybrid Mamba2-Transformer architecture, it delivers exceptional performance on complex reasoning benchmarks while maintaining efficiency for everyday conversational AI applications.

Controllable Intelligence: The model features unique runtime reasoning budget control, allowing developers to balance accuracy and response time based on their specific use case. Whether you need deep analytical thinking or quick responses, Nemotron Nano 2 adapts to your requirements.

Multilingual & Production-Ready: Supporting English, German, Spanish, French, Italian, and Japanese, this model is ready for commercial deployment with comprehensive API integration options via NVIDIA's platform and Hugging Face.

Performance benchmarks

Model

AIME 2025

GPQA Diamond

HLE

LiveCodeBench

MATH500

SWE-bench verified

96.3%

Related open-source models

Competitor closed-source models

Claude Opus 4.6

90.5%

34.2%

78.7%

OpenAI o3

83.3%

24.9%

99.2%

62.3%

OpenAI o1

76.8%

96.4%

48.9%

GPT-4o

49.2%

2.7%

32.3%

89.3%

31.0%

  • Model card

    Architecture Overview:
    • Hybrid Architecture: Primarily Mamba-2 and MLP layers combined with just four Attention layers (Nemotron-H design)
    • Context Window: Supports up to 128K tokens for extended document processing
    • Model Size: 9 billion parameters trained from scratch by NVIDIA
    • Training Period: June 2025 - August 2025 with data cutoff of September 2024
    • Training Infrastructure: Built using Megatron-LM and NeMo-RL frameworks

    Training Methodology:
    • Pretraining Corpus: Over 20 trillion tokens of high-quality curated and synthetically-generated data
    • Data Sources: English Common Crawl (3.36T tokens), Multilingual Common Crawl (812.7B tokens), GitHub Crawl (747.4B tokens)
    • Synthetic Data: Leverages reasoning traces from DeepSeek R1, Qwen3-235B-A22B, Nemotron 4 340B, and other state-of-the-art models
    • Domain Coverage: Code (43 programming languages), legal, math, science, finance, multilingual text (15 languages)
    • Post-Training: Specialized reasoning-focused instruction tuning with synthetic reasoning traces

    Performance Characteristics:
    • Reasoning Benchmarks: 72.1% AIME25, 97.8% MATH500, 64.0% GPQA, 71.1% LiveCodeBench
    • Instruction Following: 90.3% IFEval (Instruction Strict), 66.9% BFCL v3 for function calling
    • Long Context: 78.9% RULER at 128K context length
    • Reasoning Modes: Supports both "reasoning-on" (with  tags) and "reasoning-off" modes via system prompts
    • Runtime Control: Unique thinking budget control allows specification of maximum reasoning tokens
    • Multilingual: Supports English, German, Spanish, French, Italian, and Japanese with quality improvements from Qwen integration

  • Applications & use cases

    Primary Use Cases:
    • Mathematical Reasoning: Exceptional performance on AIME, MATH500, and competition-level problems
    • Scientific Analysis: Strong GPQA scores for graduate-level science questions
    • Code Generation: 71.1% LiveCodeBench with support for 43 programming languages
    • AI Agent Systems: Controllable reasoning makes it ideal for multi-step agent workflows
    • Customer Support: Reasoning budget control enables balance between accuracy and response time
    • Function Calling: Native tool-calling support via BFCL v3 benchmark validation

    Enterprise Applications:
    • RAG Systems: 128K context window supports extensive document retrieval and analysis
    • Chatbots: Multilingual support (6 languages) for global customer engagement
    • Content Moderation: Trained with Nemotron Content Safety Dataset V2 for safe outputs
    • Educational Tools: Mathematical and scientific reasoning capabilities for tutoring applications
    • Research Assistance: Long-context support for analyzing papers, reports, and technical documents

    Edge & Latency-Sensitive Deployments:
    • Efficient Architecture: Mamba2-Transformer hybrid enables faster inference than pure attention models
    • Variable Compute: Thinking budget control optimizes for time-critical applications
    • Streaming Support: Compatible with vLLM streaming for real-time response generation
    • Hardware Optimization: Optimized for NVIDIA GPUs (A10G, A100, H100) with TensorRT-LLM support

Related models
  • Model provider
    NVIDIA
  • Type
    Chat
  • Main use cases
    Chat
    Small & Fast
  • Deployment
    On-Demand Dedicated
  • Parameters
    9B
  • Context length
    128K
  • Input price

    $0.06 / 1M tokens

  • Output price

    $0.25 / 1M tokens

  • Input modalities
    Text
  • Output modalities
    Text
  • Released
    August 12, 2025
  • Last updated
    February 24, 2026
  • External link
  • Category
    Chat