NVIDIA Parakeet TDT 0.6B v3
High-throughput multilingual speech-to-text across EU official languages
About model
Parakeet TDT 0.6B v3 is NVIDIA's 600-million-parameter multilingual speech-to-text model designed for high-throughput transcription across EU official languages. Built on FastConformer-TDT architecture and trained on NVIDIA's Granary dataset (670,000+ hours of audio), the model automatically detects the input language and transcribes without additional prompting. It achieves among the highest throughput of multilingual models on the HuggingFace Open ASR Leaderboard with a 6.34% average word error rate.
670K+
Granary dataset with human-transcribed fine-tuning
All
Automatic language detection without prompting
6.34%
Among highest throughput multilingual models
Multilingual Transcription: Automatic language detection and transcription across EU official languages without additional prompting
High Throughput: FastConformer-TDT architecture optimized for real-time and large-volume transcription, among the highest throughput multilingual models on the HuggingFace Open ASR Leaderboard
Production-Ready Output: Automatic punctuation, capitalization, and word-level timestamps with every transcription
Noise Robustness: Maintains transcription accuracy across challenging acoustic environments with background noise, speaker distance variation, and overlapping speech
API usage
Endpoint:
$0.0015/min
0
Model card
Architecture Overview:
• FastConformer-TDT (Token-and-Duration Transducer) architecture with 600 million parameters
• Designed for high-throughput inference—among the highest throughput multilingual models on HuggingFace Open ASR Leaderboard
• Automatic language detection across EU official languages without prompting
• Input: 16kHz monochannel audio (.wav, .flac)
• Output: Text with automatic punctuation, capitalization, and word-level/segment-level timestamps
Training Methodology:
• Trained on NVIDIA's Granary dataset: approximately 660,000 hours of pseudo-labeled multilingual audio across EU official languages
• Fine-tuned on 10,000 hours of human-transcribed data from NeMo ASR Set 3.0 (including LibriSpeech, Fisher Corpus, Europarl-ASR, Multilingual LibriSpeech, Mozilla Common Voice)
• Initialized from CTC multilingual checkpoint, trained for 150,000 steps on 128 A100 GPUs
• Unified SentencePiece tokenizer with 8,192 tokens across all supported languages
Performance Characteristics:
• 6.34% average WER on HuggingFace Open ASR Leaderboard (English benchmarks)
• 11.97% average WER on FLEURS multilingual benchmark
• 7.83% average WER on MLS benchmark
• Maintains accuracy under noise: 7.12% WER at SNR 10, 8.23% at SNR 5
• 1.93% WER on LibriSpeech test-clean, 3.59% on test-other
Prompting
Together AI API Access:
• Access Parakeet TDT 0.6B v3 via Together AI APIs using the endpoint nvidia/parakeet-tdt-0.6b-v3
• Authenticate using your Together AI API key in request headers
• Send 16kHz monochannel audio (.wav, .flac) as input and receive transcribed text with punctuation and timestamps
• The model automatically detects the language of the input audio—no language specification required
• Available on both serverless and dedicated infrastructure for production workloads
Applications & use cases
Voice Agent Backends:
• Speech-to-text for conversational AI and voice agent pipelines
• Co-located with LLM and TTS on Together AI for unified voice infrastructure
• Automatic language detection for multilingual voice agents serving EU markets
Enterprise Transcription:
• Contact center analytics with multilingual transcription
• Medical, legal, and financial transcription across EU official languages
• Earnings calls and compliance monitoring for multinational organizations
Live Captioning & Accessibility:
• Real-time multilingual captioning for meetings, webinars, and broadcasts
• Subtitle generation across EU official languages with automatic punctuation
• Accessibility compliance for multimedia content
High-Volume Processing:
• Batch transcription of large audio archives across multiple languages
• Media and content workflows requiring high-throughput multilingual processing
- Model providerNVIDIA
- TypeAudio
- DeploymentServerlessOn-Demand Dedicated
- Endpoint
- Price
106 / 1M characters
- Price
High / 1M characters
- Input modalitiesAudio
- Output modalitiesText
- CategoryAudio