NVIDIA Nemotron 3 Nano Omni

An open omni-modal foundation model that unifies understanding and reasoning across video, audio, images, documents, charts, and text — purpose-built for agentic AI.

Deploy

read docs

About model

NVIDIA Nemotron™ 3 Nano Omni replaces fragmented, modality-specific pipelines with a single coherent system, enabling enterprises and developers to build agents that reason over real-world inputs and act with precision at production scale. By consolidating perception and reasoning into one model, it eliminates fragmented pipelines, reduces failure modes, and simplifies updates, customization, and feature evolution — enabling faster iteration cycles and more robust production systems.

Context Length

256K

Increased long context length for better reasoning

Video & image throughput

Lower compute for video reasoning

Active Parameters

From 30B total MoE architecture

Model key capabilities

Unified Multimodal Understanding: Natively processes video, audio, images, documents, charts, and GUIs within up to 256K tokens of shared multimodal context
Multi-Environment Training: Using multi‑environment RL training with NeMo RL, it improves instruction following and converges faster to correct answers gaining 19% higher multimodal intelligence
Fully Open-Source: Open weights (NVIDIA license), open data (synthetic from frontier models), open recipes for full transparency and control
Production-Ready Infrastructure: 99.9% SLA, available on Together AI dedicated infrastructure

Quickstart guides

Agents

Agent Workflows

Audio

How to Build Real-time Audio Transcription App

Agents

Mixture of Agents

Model card
Architecture Overview:
• Mixture of Experts (MoE) with hybrid Mamba-transformer architecture
• 30B total parameters with 3B activated
• 3D Convolution (Conv3D) layers for efficient temporal-spatial handling
• Efficient video sampling (EVS) for processing longer videos at the same time
• Designed for agentic production workflows
• 256K token context improves coherence by preserving long conversation history, plan state and cross-document context

Training Methodology:
• Multi-environment RL training with NeMo RL and NeMo Gym
• Open source: post-training and optimization techniques
• Open data: NVIDIA generated synthetic datasets
• Open recipes: NVIDIA development techniques and tools for customization and optimization

Performance Characteristics:
• 9x higher throughput for video reasoning
• 19% higher multimodal intelligence using multi-environment RL training
• Improved coherence from 256K-token context length
‍
Applications & use cases
Customer Service Agent:
• Processes multimodal inputs such as call recordings, screen recordings of sessions, screenshots of errors, and knowledge-base documents — all in one reasoning loop
• Understands not just what the customer said, but what they experienced and what business rules allow
Audio: Recorded customer calls
Speech: ASR transcripts of conversations
Video: Screen recordings of sessions
Images: Screenshots of errors, invoices
Docs: Knowledge base, policies, CRM

Financial Analyst Agent:
• Reasons across financial filings, charts, scanned reports, earnings call audio, and investor presentation videos.
• Ties together what executives say, how numbers are presented visually, and what underlying documents show — producing grounded insights rather than surface-level summaries.
Docs: Financial reports, earnings transcripts
Images: Charts, tables, PDFs, scanned reports
Speech: Earnings calls, analyst Q&A
Audio: Raw call recordings for tone/emphasis
Video: Investor presentations, briefings
‍
Computer Use Agent:
• Processes screen recordings to understand UI state over time, interprets instructions and system signals, and reads policy documents — all within one reasoning loop.
• Enables the agent to see the interface, understand intent, read constraints, and take the correct action.
Video:Screen recordings, UI state changes
Images:Screenshots, dashboards, forms
Speech:Spoken user instructions
Audio:System alerts, confirmation cues
Docs:Task instructions, validation policies
‍
‍

Related models

Model specifications

Model data

Model provider
NVIDIA
Type
Audio
Image
Reasoning
Video
Vision
Transcribe
Main use cases
Chat
Reasoning
Coding Agents
Deployment
On-Demand Dedicated
Monthly Reserved
Parameters
30B
Activated parameters
3B
Context length
256K
Input modalities
Text
Output modalities
Text

Category
Chat

Quickstart docs

Deploy model

NVIDIA Nemotron 3 Nano Omni

About model

Model card

Applications & use cases