Models / NVIDIA
Audio

NVIDIA Parakeet TDT 0.6B v3

High-throughput multilingual speech-to-text across EU official languages

About model

Parakeet TDT 0.6B v3 is NVIDIA's 600-million-parameter multilingual speech-to-text model designed for high-throughput transcription across EU official languages. Built on FastConformer-TDT architecture and trained on NVIDIA's Granary dataset (670,000+ hours of audio), the model automatically detects the input language and transcribes without additional prompting. It achieves among the highest throughput of multilingual models on the HuggingFace Open ASR Leaderboard with a 6.34% average word error rate.

Hours of Training Audio

670K+

Granary dataset with human-transcribed fine-tuning

EU Official Languages

All

Automatic language detection without prompting

Among highest throughput multilingual models on HuggingFace

6.34%

Among highest throughput multilingual models

Model key capabilities

Multilingual Transcription: Automatic language detection and transcription across EU official languages without additional prompting

High Throughput: FastConformer-TDT architecture optimized for real-time and large-volume transcription, among the highest throughput multilingual models on the HuggingFace Open ASR Leaderboard

Production-Ready Output: Automatic punctuation, capitalization, and word-level timestamps with every transcription

Noise Robustness: Maintains transcription accuracy across challenging acoustic environments with background noise, speaker distance variation, and overlapping speech

  • API usage

    • cURL
    • Python
    • Typescript

    Endpoint:

    nvidia/parakeet-tdt-0.6b-v3

    $0.0015/min

    0

    import Together from 'together-ai';
    
    const together = new Together();
    
    async function generateAudio() {
       const res = await together.audio.create({
        input: 'Today is a wonderful day to build something people love!',
        voice: 'helpful woman',
        response_format: 'mp3',
        sample_rate: 44100,
        stream: false,
        model: 'nvidia/parakeet-tdt-0.6b-v3',
      });
    
      if (res.body) {
        console.log(res.body);
        const nodeStream = Readable.from(res.body as ReadableStream);
        const fileStream = createWriteStream('./speech.mp3');
    
        nodeStream.pipe(fileStream);
      }
    }
    
    generateAudio();
    
    
  • Model card

    Architecture Overview:
    • FastConformer-TDT (Token-and-Duration Transducer) architecture with 600 million parameters
    • Designed for high-throughput inference—among the highest throughput multilingual models on HuggingFace Open ASR Leaderboard
    • Automatic language detection across EU official languages without prompting
    • Input: 16kHz monochannel audio (.wav, .flac)
    • Output: Text with automatic punctuation, capitalization, and word-level/segment-level timestamps

    Training Methodology:
    • Trained on NVIDIA's Granary dataset: approximately 660,000 hours of pseudo-labeled multilingual audio across EU official languages
    • Fine-tuned on 10,000 hours of human-transcribed data from NeMo ASR Set 3.0 (including LibriSpeech, Fisher Corpus, Europarl-ASR, Multilingual LibriSpeech, Mozilla Common Voice)
    • Initialized from CTC multilingual checkpoint, trained for 150,000 steps on 128 A100 GPUs
    • Unified SentencePiece tokenizer with 8,192 tokens across all supported languages

    Performance Characteristics:
    • 6.34% average WER on HuggingFace Open ASR Leaderboard (English benchmarks)
    • 11.97% average WER on FLEURS multilingual benchmark
    • 7.83% average WER on MLS benchmark
    • Maintains accuracy under noise: 7.12% WER at SNR 10, 8.23% at SNR 5
    • 1.93% WER on LibriSpeech test-clean, 3.59% on test-other

  • Prompting

    Together AI API Access:
    • Access Parakeet TDT 0.6B v3 via Together AI APIs using the endpoint nvidia/parakeet-tdt-0.6b-v3
    • Authenticate using your Together AI API key in request headers
    • Send 16kHz monochannel audio (.wav, .flac) as input and receive transcribed text with punctuation and timestamps
    • The model automatically detects the language of the input audio—no language specification required
    • Available on both serverless and dedicated infrastructure for production workloads

  • Applications & use cases

    Voice Agent Backends:
    • Speech-to-text for conversational AI and voice agent pipelines
    • Co-located with LLM and TTS on Together AI for unified voice infrastructure
    • Automatic language detection for multilingual voice agents serving EU markets

    Enterprise Transcription:
    • Contact center analytics with multilingual transcription
    • Medical, legal, and financial transcription across EU official languages
    • Earnings calls and compliance monitoring for multinational organizations

    Live Captioning & Accessibility:
    • Real-time multilingual captioning for meetings, webinars, and broadcasts
    • Subtitle generation across EU official languages with automatic punctuation
    • Accessibility compliance for multimedia content

    High-Volume Processing:
    • Batch transcription of large audio archives across multiple languages
    • Media and content workflows requiring high-throughput multilingual processing

Related models
  • Model provider
    NVIDIA
  • Type
    Audio
  • Deployment
    Serverless
    On-Demand Dedicated
  • Price

    106 / 1M characters

  • Price

    High / 1M characters

  • Input modalities
    Audio
  • Output modalities
    Text
  • Category
    Audio