Models / NVIDIA
Audio

NVIDIA Nemotron 3.5 ASR

Cache-aware streaming speech recognition across 40 language locales at sub-100ms latency

About model

NVIDIA Nemotron 3.5 ASR is a 0.6B parameter cache-aware streaming speech recognition model covering 40 language-locale combinations across approximately 36 languages in a single checkpoint. It delivers sub-100ms time-to-final transcription with runtime-configurable latency (80ms, 160ms, 560ms, or 1.12s) and 3x higher GPU concurrency compared to buffered approaches. The model provides native punctuation and capitalization across all 40 locales and supports locale-aware variants including regional dialect differentiation. Available under the NVIDIA Open Model License.

Language Locales

40

Single checkpoint covering ~36 languages with locale-aware variants

Time-to-Final

Sub-100ms

Runtime-configurable latency from 80ms to 1.12s without retraining

GPU Concurrency

3x

Versus buffered approaches on H100

Model key capabilities
  • 40-Locale Streaming: Single checkpoint covering ~36 languages across 40 locale combinations — including 23 of 24 EU official languages, major Asian markets, and right-to-left scripts — with no per-language deployment
  • Cache-Aware Architecture: Eliminates overlap recompute for stable latency at high concurrency, delivering 3x concurrent streams versus buffered baselines
  • Runtime-Configurable Latency: Switch between 80ms, 160ms, 560ms, and 1.12s chunk sizes at runtime without retraining, enabling precise latency-accuracy tradeoffs per deployment
  • Production-Ready Output: Native punctuation and capitalization across all 40 locales, with locale-aware dialect variants (es-ES vs es-US, pt-BR vs pt-PT, fr-FR vs fr-CA)
  • API usage

    • cURL
    • Python
    • Typescript

    Endpoint:

    nvidia/nemotron-3.5-asr-streaming-0.6b

    curl -X POST "https://api.together.xyz/v1/audio/transcriptions" \
      -H "Authorization: Bearer $TOGETHER_API_KEY" \
      -F "model=nvidia/nemotron-3.5-asr-streaming-0.6b" \
      -F "language=en" \
      -F "response_format=json" \
      -F "timestamp_granularities=segment"
    
    from together import Together
    
    client = Together()
    response = client.audio.transcribe(
        model="nvidia/nemotron-3.5-asr-streaming-0.6b",
        language="en",
        response_format="json",
        timestamp_granularities="segment"
    )
    print(response.text)
    
    import Together from "together-ai";
    
    const together = new Together();
    
    const response = await together.audio.transcriptions.create(
      model: "nvidia/nemotron-3.5-asr-streaming-0.6b",
      language: "en",
      response_format: "json",
      timestamp_granularities: "segment"
    });
    console.log(response)
    
  • Model card

    Architecture Overview:
    • 600M (0.6B) parameter cache-aware streaming ASR model
    • Cache-aware FastConformer streaming: no overlap recompute, stable latency at scale
    • Runtime-configurable chunk sizes: 80ms, 160ms, 560ms, 1.12s — no retraining required
    • Single checkpoint for all 40 language-locale combinations
    • Script support: Latin, Cyrillic, Arabic, Hebrew, CJK, Devanagari, Thai
    • Audio in, text out

    Language Coverage:
    • 40 language-locale combinations across ~36 languages
    • 23 of 24 EU official languages
    • Major Asian markets: Chinese (Mandarin), Japanese, Korean, Vietnamese, Thai, Hindi
    • Right-to-left scripts: Arabic, Hebrew
    • Locale-aware variants: es-ES vs es-US, pt-BR vs pt-PT, fr-FR vs fr-CA, nb-NO vs nn-NO

    Performance Characteristics:
    • Sub-100ms time-to-final transcription
    • 3x higher GPU concurrency versus buffered baselines
    • Native punctuation and capitalization across all 40 locales without post-processing

  • Prompting

    Together AI API Access:
    • Access NVIDIA Nemotron 3.5 ASR using the endpoint nvidia/nemotron-3.5-asr-streaming-0.6b
    • Authenticate using your Together AI API key in request headers
    • Configure chunk size at runtime (80ms, 160ms, 560ms, 1.12s) to tune latency per deployment
    • Available on Together AI on-demand dedicated infrastructure

  • Applications & use cases

    Multilingual Voice Agents:
    • Real-time transcription for customer support, in-car assistants, and retail kiosks across 40 locales
    • Sub-100ms latency for conversational voice agent pipelines
    • Single model serving global markets without per-language infrastructure

    Healthcare & Clinical Documentation:
    • Multilingual clinical dictation and documentation across languages
    • Airgapped and sovereign deployments where data cannot leave a region

    Live Captioning & Transcription:
    • Real-time captioning for events, broadcast, and video conferencing across languages
    • Meeting and call transcription for sales intelligence and contact-center QA
    • Accessibility tooling with live captions in users' native languages

    Post-Call & Offline Analytics:
    • Batch transcription of multilingual call recordings at production scale
    • Speech analytics across 40 locales in airgapped or regulated environments

Related models
  • Model provider
    NVIDIA
  • Type
    Audio
  • Deployment
    On-Demand Dedicated
    Serverless
  • Parameters
    0.6B
  • Price

    $0.0045/min / min

  • Input modalities
    Audio
  • Output modalities
    Text
  • Released
    June 4, 2026
  • Category
    Transcribe