NVIDIA-Nemotron-Nano-9B-v2 API

Advanced reasoning model with controllable thinking budget

Try it now
New

This model is not currently supported on Together AI.

Visit our Models page to view all the latest models.

Unified Reasoning & Chat Model: NVIDIA-Nemotron-Nano-9B-v2 is a cutting-edge large language model designed as a unified solution for both reasoning and non-reasoning tasks. Built with a hybrid Mamba2-Transformer architecture, it delivers exceptional performance on complex reasoning benchmarks while maintaining efficiency for everyday conversational AI applications.

Controllable Intelligence: The model features unique runtime reasoning budget control, allowing developers to balance accuracy and response time based on their specific use case. Whether you need deep analytical thinking or quick responses, Nemotron Nano 2 adapts to your requirements.

Multilingual & Production-Ready: Supporting English, German, Spanish, French, Italian, and Japanese, this model is ready for commercial deployment with comprehensive API integration options via NVIDIA's platform and Hugging Face.

NVIDIA-Nemotron-Nano-9B-v2 API Usage

Endpoint

curl -X POST "https://api.together.xyz/v1/chat/completions" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    "messages": [
      {
        "role": "user",
        "content": "What are some fun things to do in New York?"
      }
    ]
}'
curl -X POST "https://api.together.xyz/v1/images/generations" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    "prompt": "Draw an anime style version of this image.",
    "width": 1024,
    "height": 768,
    "steps": 28,
    "n": 1,
    "response_format": "url",
    "image_url": "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"
  }'
curl -X POST https://api.together.xyz/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe what you see in this image."},
        {"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"}}
      ]
    }],
    "max_tokens": 512
  }'
curl -X POST https://api.together.xyz/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    "messages": [{
      "role": "user",
      "content": "Given two binary strings `a` and `b`, return their sum as a binary string"
    }]
  }'
curl -X POST https://api.together.xyz/v1/rerank \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    "query": "What animals can I find near Peru?",
    "documents": [
      "The giant panda (Ailuropoda melanoleuca), also known as the panda bear or simply panda, is a bear species endemic to China.",
      "The llama is a domesticated South American camelid, widely used as a meat and pack animal by Andean cultures since the pre-Columbian era.",
      "The wild Bactrian camel (Camelus ferus) is an endangered species of camel endemic to Northwest China and southwestern Mongolia.",
      "The guanaco is a camelid native to South America, closely related to the llama. Guanacos are one of two wild South American camelids; the other species is the vicuña, which lives at higher elevations."
    ],
    "top_n": 2
  }'
curl -X POST https://api.together.xyz/v1/embeddings \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Our solar system orbits the Milky Way galaxy at about 515,000 mph.",
    "model": "nvidia/NVIDIA-Nemotron-Nano-9B-v2"
  }'
curl -X POST https://api.together.xyz/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -d '{
    "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
    "prompt": "A horse is a horse",
    "max_tokens": 32,
    "temperature": 0.1,
    "safety_model": "nvidia/NVIDIA-Nemotron-Nano-9B-v2"
  }'
curl --location 'https://api.together.ai/v1/audio/generations' \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer $TOGETHER_API_KEY' \
  --output speech.mp3 \
  --data '{
    "input": "Today is a wonderful day to build something people love!",
    "voice": "helpful woman",
    "response_format": "mp3",
    "sample_rate": 44100,
    "stream": false,
    "model": "nvidia/NVIDIA-Nemotron-Nano-9B-v2"
  }'
curl -X POST "https://api.together.xyz/v1/audio/transcriptions" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -F "model=nvidia/NVIDIA-Nemotron-Nano-9B-v2" \
  -F "language=en" \
  -F "response_format=json" \
  -F "timestamp_granularities=segment"
from together import Together

client = Together()

response = client.chat.completions.create(
  model="nvidia/NVIDIA-Nemotron-Nano-9B-v2",
  messages=[
    {
      "role": "user",
      "content": "What are some fun things to do in New York?"
    }
  ]
)
print(response.choices[0].message.content)
from together import Together

client = Together()

imageCompletion = client.images.generate(
    model="nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    width=1024,
    height=768,
    steps=28,
    prompt="Draw an anime style version of this image.",
    image_url="https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png",
)

print(imageCompletion.data[0].url)


from together import Together

client = Together()

response = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    messages=[{
    	"role": "user",
      "content": [
        {"type": "text", "text": "Describe what you see in this image."},
        {"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"}}
      ]
    }]
)
print(response.choices[0].message.content)

from together import Together

client = Together()
response = client.chat.completions.create(
  model="nvidia/NVIDIA-Nemotron-Nano-9B-v2",
  messages=[
  	{
	    "role": "user", 
      "content": "Given two binary strings `a` and `b`, return their sum as a binary string"
    }
 ],
)

print(response.choices[0].message.content)

from together import Together

client = Together()

query = "What animals can I find near Peru?"

documents = [
  "The giant panda (Ailuropoda melanoleuca), also known as the panda bear or simply panda, is a bear species endemic to China.",
  "The llama is a domesticated South American camelid, widely used as a meat and pack animal by Andean cultures since the pre-Columbian era.",
  "The wild Bactrian camel (Camelus ferus) is an endangered species of camel endemic to Northwest China and southwestern Mongolia.",
  "The guanaco is a camelid native to South America, closely related to the llama. Guanacos are one of two wild South American camelids; the other species is the vicuña, which lives at higher elevations.",
]

response = client.rerank.create(
  model="nvidia/NVIDIA-Nemotron-Nano-9B-v2",
  query=query,
  documents=documents,
  top_n=2
)

for result in response.results:
    print(f"Relevance Score: {result.relevance_score}")

from together import Together

client = Together()

response = client.embeddings.create(
  model = "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
  input = "Our solar system orbits the Milky Way galaxy at about 515,000 mph"
)

from together import Together

client = Together()

response = client.completions.create(
  model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
  prompt="A horse is a horse",
  max_tokens=32,
  temperature=0.1,
  safety_model="nvidia/NVIDIA-Nemotron-Nano-9B-v2",
)

print(response.choices[0].text)

from together import Together

client = Together()

speech_file_path = "speech.mp3"

response = client.audio.speech.create(
  model="nvidia/NVIDIA-Nemotron-Nano-9B-v2",
  input="Today is a wonderful day to build something people love!",
  voice="helpful woman",
)
    
response.stream_to_file(speech_file_path)

from together import Together

client = Together()
response = client.audio.transcribe(
    model="nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    language="en",
    response_format="json",
    timestamp_granularities="segment"
)
print(response.text)
import Together from 'together-ai';
const together = new Together();

const completion = await together.chat.completions.create({
  model: 'nvidia/NVIDIA-Nemotron-Nano-9B-v2',
  messages: [
    {
      role: 'user',
      content: 'What are some fun things to do in New York?'
     }
  ],
});

console.log(completion.choices[0].message.content);
import Together from "together-ai";

const together = new Together();

async function main() {
  const response = await together.images.create({
    model: "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    width: 1024,
    height: 1024,
    steps: 28,
    prompt: "Draw an anime style version of this image.",
    image_url: "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png",
  });

  console.log(response.data[0].url);
}

main();

import Together from "together-ai";

const together = new Together();
const imageUrl = "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png";

async function main() {
  const response = await together.chat.completions.create({
    model: "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    messages: [{
      role: "user",
      content: [
        { type: "text", text: "Describe what you see in this image." },
        { type: "image_url", image_url: { url: imageUrl } }
      ]
    }]
  });
  
  console.log(response.choices[0]?.message?.content);
}

main();

import Together from "together-ai";

const together = new Together();

async function main() {
  const response = await together.chat.completions.create({
    model: "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    messages: [{
      role: "user",
      content: "Given two binary strings `a` and `b`, return their sum as a binary string"
    }]
  });
  
  console.log(response.choices[0]?.message?.content);
}

main();

import Together from "together-ai";

const together = new Together();

const query = "What animals can I find near Peru?";
const documents = [
  "The giant panda (Ailuropoda melanoleuca), also known as the panda bear or simply panda, is a bear species endemic to China.",
  "The llama is a domesticated South American camelid, widely used as a meat and pack animal by Andean cultures since the pre-Columbian era.",
  "The wild Bactrian camel (Camelus ferus) is an endangered species of camel endemic to Northwest China and southwestern Mongolia.",
  "The guanaco is a camelid native to South America, closely related to the llama. Guanacos are one of two wild South American camelids; the other species is the vicuña, which lives at higher elevations."
];

async function main() {
  const response = await together.rerank.create({
    model: "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    query: query,
    documents: documents,
    top_n: 2
  });
  
  for (const result of response.results) {
    console.log(`Relevance Score: ${result.relevance_score}`);
  }
}

main();


import Together from "together-ai";

const together = new Together();

const response = await client.embeddings.create({
  model: 'nvidia/NVIDIA-Nemotron-Nano-9B-v2',
  input: 'Our solar system orbits the Milky Way galaxy at about 515,000 mph',
});

import Together from "together-ai";

const together = new Together();

async function main() {
  const response = await together.completions.create({
    model: "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
    prompt: "A horse is a horse",
    max_tokens: 32,
    temperature: 0.1,
    safety_model: "nvidia/NVIDIA-Nemotron-Nano-9B-v2"
  });
  
  console.log(response.choices[0]?.text);
}

main();

import Together from 'together-ai';

const together = new Together();

async function generateAudio() {
   const res = await together.audio.create({
    input: 'Today is a wonderful day to build something people love!',
    voice: 'helpful woman',
    response_format: 'mp3',
    sample_rate: 44100,
    stream: false,
    model: 'nvidia/NVIDIA-Nemotron-Nano-9B-v2',
  });

  if (res.body) {
    console.log(res.body);
    const nodeStream = Readable.from(res.body as ReadableStream);
    const fileStream = createWriteStream('./speech.mp3');

    nodeStream.pipe(fileStream);
  }
}

generateAudio();

import Together from "together-ai";

const together = new Together();

const response = await together.audio.transcriptions.create(
  model: "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
  language: "en",
  response_format: "json",
  timestamp_granularities: "segment"
});
console.log(response)

How to use NVIDIA-Nemotron-Nano-9B-v2

Model details

Architecture Overview:
• Hybrid Architecture: Primarily Mamba-2 and MLP layers combined with just four Attention layers (Nemotron-H design)
• Context Window: Supports up to 128K tokens for extended document processing
• Model Size: 9 billion parameters trained from scratch by NVIDIA
• Training Period: June 2025 - August 2025 with data cutoff of September 2024
• Training Infrastructure: Built using Megatron-LM and NeMo-RL frameworks

Training Methodology:
• Pretraining Corpus: Over 20 trillion tokens of high-quality curated and synthetically-generated data
• Data Sources: English Common Crawl (3.36T tokens), Multilingual Common Crawl (812.7B tokens), GitHub Crawl (747.4B tokens)
• Synthetic Data: Leverages reasoning traces from DeepSeek R1, Qwen3-235B-A22B, Nemotron 4 340B, and other state-of-the-art models
• Domain Coverage: Code (43 programming languages), legal, math, science, finance, multilingual text (15 languages)
• Post-Training: Specialized reasoning-focused instruction tuning with synthetic reasoning traces

Performance Characteristics:
• Reasoning Benchmarks: 72.1% AIME25, 97.8% MATH500, 64.0% GPQA, 71.1% LiveCodeBench
• Instruction Following: 90.3% IFEval (Instruction Strict), 66.9% BFCL v3 for function calling
• Long Context: 78.9% RULER at 128K context length
• Reasoning Modes: Supports both "reasoning-on" (with  tags) and "reasoning-off" modes via system prompts
• Runtime Control: Unique thinking budget control allows specification of maximum reasoning tokens
• Multilingual: Supports English, German, Spanish, French, Italian, and Japanese with quality improvements from Qwen integration

Prompting NVIDIA-Nemotron-Nano-9B-v2

Applications & Use Cases

Primary Use Cases:
• Mathematical Reasoning: Exceptional performance on AIME, MATH500, and competition-level problems
• Scientific Analysis: Strong GPQA scores for graduate-level science questions
• Code Generation: 71.1% LiveCodeBench with support for 43 programming languages
• AI Agent Systems: Controllable reasoning makes it ideal for multi-step agent workflows
• Customer Support: Reasoning budget control enables balance between accuracy and response time
• Function Calling: Native tool-calling support via BFCL v3 benchmark validation

Enterprise Applications:
• RAG Systems: 128K context window supports extensive document retrieval and analysis
• Chatbots: Multilingual support (6 languages) for global customer engagement
• Content Moderation: Trained with Nemotron Content Safety Dataset V2 for safe outputs
• Educational Tools: Mathematical and scientific reasoning capabilities for tutoring applications
• Research Assistance: Long-context support for analyzing papers, reports, and technical documents

Edge & Latency-Sensitive Deployments:
• Efficient Architecture: Mamba2-Transformer hybrid enables faster inference than pure attention models
• Variable Compute: Thinking budget control optimizes for time-critical applications
• Streaming Support: Compatible with vLLM streaming for real-time response generation
• Hardware Optimization: Optimized for NVIDIA GPUs (A10G, A100, H100) with TensorRT-LLM support

Looking for production scale? Deploy on a dedicated endpoint

Deploy NVIDIA-Nemotron-Nano-9B-v2 on a dedicated endpoint with custom hardware configuration, as many instances as you need, and auto-scaling.

Get started