Models / NVIDIA
Audio
Image
Reasoning
Video
Vision
Transcribe

NVIDIA Nemotron 3 Nano Omni

An open omni-modal foundation model that unifies understanding and reasoning across video, audio, images, documents, charts, and text — purpose-built for agentic AI.

About model

NVIDIA Nemotron™ 3 Nano Omni replaces fragmented, modality-specific pipelines with a single coherent system, enabling enterprises and developers to build agents that reason over real-world inputs and act with precision at production scale. By consolidating perception and reasoning into one model, it eliminates fragmented pipelines, reduces failure modes, and simplifies updates, customization, and feature evolution — enabling faster iteration cycles and more robust production systems.

Context Length

256K

Increased long context length for better reasoning

Video & image throughput

9x

Lower compute for video reasoning

Active Parameters

3B

From 30B total MoE architecture

Model key capabilities
  • Unified Multimodal Understanding: Natively processes video, audio, images, documents, charts, and GUIs within up to 256K tokens of shared multimodal context
  • Multi-Environment Training:  Using multi‑environment RL training with NeMo RL, it improves instruction following and converges faster to correct answers gaining 19% higher multimodal intelligence
  • Fully Open-Source: Open weights (NVIDIA license), open data (synthetic from frontier models), open recipes for full transparency and control
  • Production-Ready Infrastructure: 99.9% SLA, available on Together AI dedicated infrastructure
  • Model card

    Architecture Overview:
    • Mixture of Experts (MoE) with hybrid Mamba-transformer architecture
    • 30B total parameters with 3B activated
    • 3D Convolution (Conv3D) layers for efficient temporal-spatial handling
    • Efficient video sampling (EVS) for processing longer videos at the same time
    • Designed for agentic production workflows
    • 256K token context improves coherence by preserving long conversation history, plan state and cross-document context


    Training Methodology:
    • Multi-environment RL training with NeMo RL and NeMo Gym
    • Open source: post-training and optimization techniques
    • Open data: NVIDIA generated synthetic datasets
    • Open recipes: NVIDIA development techniques and tools for customization and optimization


    Performance Characteristics:
    • 9x higher throughput for video reasoning
    • 19% higher multimodal intelligence using multi-environment RL training
    • Improved coherence from 256K-token context length

  • Applications & use cases

    Customer Service Agent:
    • Processes multimodal inputs such as call recordings, screen recordings of sessions, screenshots of errors, and knowledge-base documents — all in one reasoning loop
    • Understands not just what the customer said, but what they experienced and what business rules allow

    Audio: Recorded customer calls

    Speech: ASR transcripts of conversations

    Video: Screen recordings of sessions

    Images: Screenshots of errors, invoices

    Docs: Knowledge base, policies, CRM



    Financial Analyst Agent:
    • Reasons across financial filings, charts, scanned reports, earnings call audio, and investor presentation videos.
    • Ties together what executives say, how numbers are presented visually, and what underlying documents show — producing grounded insights rather than surface-level summaries.

    Docs: Financial reports, earnings transcripts

    Images: Charts, tables, PDFs, scanned reports

    Speech: Earnings calls, analyst Q&A

    Audio: Raw call recordings for tone/emphasis

    Video: Investor presentations, briefings

    Computer Use Agent:

    • Processes screen recordings to understand UI state over time, interprets instructions and system signals, and reads policy documents — all within one reasoning loop.
    • Enables the agent to see the interface, understand intent, read constraints, and take the correct action.

    Video:Screen recordings, UI state changes

    Images:Screenshots, dashboards, forms

    Speech:Spoken user instructions

    Audio:System alerts, confirmation cues

    Docs:Task instructions, validation policies

Related models
  • Model provider
    NVIDIA
  • Type
    Audio
    Image
    Reasoning
    Video
    Vision
    Transcribe
  • Main use cases
    Chat
    Reasoning
    Coding Agents
  • Deployment
    On-Demand Dedicated
    Monthly Reserved
  • Parameters
    30B
  • Activated parameters
    3B
  • Context length
    256K
  • Input modalities
    Text
  • Output modalities
    Text
  • Category
    Chat