NVIDIA Nemotron 3.5 ASR
Cache-aware streaming speech recognition across 40 language locales at sub-100ms latency
About model
NVIDIA Nemotron 3.5 ASR is a 0.6B parameter cache-aware streaming speech recognition model covering 40 language-locale combinations across approximately 36 languages in a single checkpoint. It delivers sub-100ms time-to-final transcription with runtime-configurable latency (80ms, 160ms, 560ms, or 1.12s) and 3x higher GPU concurrency compared to buffered approaches. The model provides native punctuation and capitalization across all 40 locales and supports locale-aware variants including regional dialect differentiation. Available under the NVIDIA Open Model License.
40
Single checkpoint covering ~36 languages with locale-aware variants
Sub-100ms
Runtime-configurable latency from 80ms to 1.12s without retraining
3x
Versus buffered approaches on H100
- 40-Locale Streaming: Single checkpoint covering ~36 languages across 40 locale combinations — including 23 of 24 EU official languages, major Asian markets, and right-to-left scripts — with no per-language deployment
- Cache-Aware Architecture: Eliminates overlap recompute for stable latency at high concurrency, delivering 3x concurrent streams versus buffered baselines
- Runtime-Configurable Latency: Switch between 80ms, 160ms, 560ms, and 1.12s chunk sizes at runtime without retraining, enabling precise latency-accuracy tradeoffs per deployment
- Production-Ready Output: Native punctuation and capitalization across all 40 locales, with locale-aware dialect variants (es-ES vs es-US, pt-BR vs pt-PT, fr-FR vs fr-CA)
API usage
Endpoint:
Model card
Architecture Overview:
• 600M (0.6B) parameter cache-aware streaming ASR model
• Cache-aware FastConformer streaming: no overlap recompute, stable latency at scale
• Runtime-configurable chunk sizes: 80ms, 160ms, 560ms, 1.12s — no retraining required
• Single checkpoint for all 40 language-locale combinations
• Script support: Latin, Cyrillic, Arabic, Hebrew, CJK, Devanagari, Thai
• Audio in, text out
Language Coverage:
• 40 language-locale combinations across ~36 languages
• 23 of 24 EU official languages
• Major Asian markets: Chinese (Mandarin), Japanese, Korean, Vietnamese, Thai, Hindi
• Right-to-left scripts: Arabic, Hebrew
• Locale-aware variants: es-ES vs es-US, pt-BR vs pt-PT, fr-FR vs fr-CA, nb-NO vs nn-NO
Performance Characteristics:
• Sub-100ms time-to-final transcription
• 3x higher GPU concurrency versus buffered baselines
• Native punctuation and capitalization across all 40 locales without post-processing
Prompting
Together AI API Access:
• Access NVIDIA Nemotron 3.5 ASR using the endpoint nvidia/nemotron-3.5-asr-streaming-0.6b
• Authenticate using your Together AI API key in request headers
• Configure chunk size at runtime (80ms, 160ms, 560ms, 1.12s) to tune latency per deployment
• Available on Together AI on-demand dedicated infrastructure
Applications & use cases
Multilingual Voice Agents:
• Real-time transcription for customer support, in-car assistants, and retail kiosks across 40 locales
• Sub-100ms latency for conversational voice agent pipelines
• Single model serving global markets without per-language infrastructure
Healthcare & Clinical Documentation:
• Multilingual clinical dictation and documentation across languages
• Airgapped and sovereign deployments where data cannot leave a region
Live Captioning & Transcription:
• Real-time captioning for events, broadcast, and video conferencing across languages
• Meeting and call transcription for sales intelligence and contact-center QA
• Accessibility tooling with live captions in users' native languages
Post-Call & Offline Analytics:
• Batch transcription of multilingual call recordings at production scale
• Speech analytics across 40 locales in airgapped or regulated environments
- Model providerNVIDIA
- TypeAudio
- DeploymentOn-Demand DedicatedServerless
- Parameters0.6B
- Price
$0.0045/min / min
- Input modalitiesAudio
- Output modalitiesText
- ReleasedJune 4, 2026
- CategoryTranscribe