NVIDIA Nemotron 3 Nano Omni
An open omni-modal foundation model that unifies understanding and reasoning across video, audio, images, documents, charts, and text — purpose-built for agentic AI.
About model
NVIDIA Nemotron™ 3 Nano Omni replaces fragmented, modality-specific pipelines with a single coherent system, enabling enterprises and developers to build agents that reason over real-world inputs and act with precision at production scale. By consolidating perception and reasoning into one model, it eliminates fragmented pipelines, reduces failure modes, and simplifies updates, customization, and feature evolution — enabling faster iteration cycles and more robust production systems.
256K
Increased long context length for better reasoning
9x
Lower compute for video reasoning
3B
From 30B total MoE architecture
- Unified Multimodal Understanding: Natively processes video, audio, images, documents, charts, and GUIs within up to 256K tokens of shared multimodal context
- Multi-Environment Training: Using multi‑environment RL training with NeMo RL, it improves instruction following and converges faster to correct answers gaining 19% higher multimodal intelligence
- Fully Open-Source: Open weights (NVIDIA license), open data (synthetic from frontier models), open recipes for full transparency and control
- Production-Ready Infrastructure: 99.9% SLA, available on Together AI dedicated infrastructure
Model card
Architecture Overview:
• Mixture of Experts (MoE) with hybrid Mamba-transformer architecture
• 30B total parameters with 3B activated
• 3D Convolution (Conv3D) layers for efficient temporal-spatial handling
• Efficient video sampling (EVS) for processing longer videos at the same time
• Designed for agentic production workflows
• 256K token context improves coherence by preserving long conversation history, plan state and cross-document context
Training Methodology:
• Multi-environment RL training with NeMo RL and NeMo Gym
• Open source: post-training and optimization techniques
• Open data: NVIDIA generated synthetic datasets
• Open recipes: NVIDIA development techniques and tools for customization and optimization
Performance Characteristics:
• 9x higher throughput for video reasoning
• 19% higher multimodal intelligence using multi-environment RL training
• Improved coherence from 256K-token context length
Applications & use cases
Customer Service Agent:
• Processes multimodal inputs such as call recordings, screen recordings of sessions, screenshots of errors, and knowledge-base documents — all in one reasoning loop
• Understands not just what the customer said, but what they experienced and what business rules allowAudio: Recorded customer calls
Speech: ASR transcripts of conversations
Video: Screen recordings of sessions
Images: Screenshots of errors, invoices
Docs: Knowledge base, policies, CRM
Financial Analyst Agent:
• Reasons across financial filings, charts, scanned reports, earnings call audio, and investor presentation videos.
• Ties together what executives say, how numbers are presented visually, and what underlying documents show — producing grounded insights rather than surface-level summaries.Docs: Financial reports, earnings transcripts
Images: Charts, tables, PDFs, scanned reports
Speech: Earnings calls, analyst Q&A
Audio: Raw call recordings for tone/emphasis
Video: Investor presentations, briefings
Computer Use Agent:
• Processes screen recordings to understand UI state over time, interprets instructions and system signals, and reads policy documents — all within one reasoning loop.
• Enables the agent to see the interface, understand intent, read constraints, and take the correct action.Video:Screen recordings, UI state changes
Images:Screenshots, dashboards, forms
Speech:Spoken user instructions
Audio:System alerts, confirmation cues
Docs:Task instructions, validation policies
- Model providerNVIDIA
- TypeAudioImageReasoningVideoVisionTranscribe
- Main use casesChatReasoningCoding Agents
- DeploymentOn-Demand DedicatedMonthly Reserved
- Parameters30B
- Activated parameters3B
- Context length256K
- Input modalitiesText
- Output modalitiesText
- CategoryChat