NVIDIA Parakeet TDT 0.6B v3

High-throughput multilingual speech-to-text across EU official languages

About model

Parakeet TDT 0.6B v3 is NVIDIA's 600-million-parameter multilingual speech-to-text model designed for high-throughput transcription across EU official languages. Built on FastConformer-TDT architecture and trained on NVIDIA's Granary dataset (670,000+ hours of audio), the model automatically detects the input language and transcribes without additional prompting. It achieves among the highest throughput of multilingual models on the HuggingFace Open ASR Leaderboard with a 6.34% average word error rate.

EU Official Languages

Automatic language detection without prompting

Avg WER (Open ASR Leaderboard)

6.34%

Among highest throughput multilingual models on HuggingFace

Hours of Training Audio

670K+

Granary dataset with human-transcribed fine-tuning

Model key capabilities

Multilingual Transcription: Automatic language detection and transcription across EU official languages without additional prompting

High Throughput: FastConformer-TDT architecture optimized for real-time and large-volume transcription, among the highest throughput multilingual models on the HuggingFace Open ASR Leaderboard

Production-Ready Output: Automatic punctuation, capitalization, and word-level timestamps with every transcription

Noise Robustness: Maintains transcription accuracy across challenging acoustic environments with background noise, speaker distance variation, and overlapping speech

API usage

cURL
Python
Typescript

Endpoint:

nvidia/parakeet-tdt-0.6b-v3

curl -X POST "https://api.together.xyz/v1/audio/transcriptions" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -F "model=nvidia/parakeet-tdt-0.6b-v3" \
  -F "language=en" \
  -F "response_format=json" \
  -F "timestamp_granularities=segment"

from together import Together

client = Together()
response = client.audio.transcribe(
    model="nvidia/parakeet-tdt-0.6b-v3",
    language="en",
    response_format="json",
    timestamp_granularities="segment"
)
print(response.text)

import Together from "together-ai";

const together = new Together();

const response = await together.audio.transcriptions.create(
  model: "nvidia/parakeet-tdt-0.6b-v3",
  language: "en",
  response_format: "json",
  timestamp_granularities: "segment"
});
console.log(response)

Model card
Architecture Overview:
• FastConformer-TDT (Token-and-Duration Transducer) architecture with 600 million parameters
• Designed for high-throughput inference—among the highest throughput multilingual models on HuggingFace Open ASR Leaderboard
• Automatic language detection across EU official languages without prompting
• Input: 16kHz monochannel audio (.wav, .flac)
• Output: Text with automatic punctuation, capitalization, and word-level/segment-level timestamps

Training Methodology:
• Trained on NVIDIA's Granary dataset: approximately 660,000 hours of pseudo-labeled multilingual audio across EU official languages
• Fine-tuned on 10,000 hours of human-transcribed data from NeMo ASR Set 3.0 (including LibriSpeech, Fisher Corpus, Europarl-ASR, Multilingual LibriSpeech, Mozilla Common Voice)
• Initialized from CTC multilingual checkpoint, trained for 150,000 steps on 128 A100 GPUs
• Unified SentencePiece tokenizer with 8,192 tokens across all supported languages

Performance Characteristics:
• 6.34% average WER on HuggingFace Open ASR Leaderboard (English benchmarks)
• 11.97% average WER on FLEURS multilingual benchmark
• 7.83% average WER on MLS benchmark
• Maintains accuracy under noise: 7.12% WER at SNR 10, 8.23% at SNR 5
• 1.93% WER on LibriSpeech test-clean, 3.59% on test-other
‍
Prompting
Together AI API Access:
• Access Parakeet TDT 0.6B v3 via Together AI APIs using the endpoint nvidia/parakeet-tdt-0.6b-v3
• Authenticate using your Together AI API key in request headers
• Send 16kHz monochannel audio (.wav, .flac) as input and receive transcribed text with punctuation and timestamps
• The model automatically detects the language of the input audio—no language specification required
• Available on both serverless and dedicated infrastructure for production workloads
‍
Applications & use cases
Voice Agent Backends:
• Speech-to-text for conversational AI and voice agent pipelines
• Co-located with LLM and TTS on Together AI for unified voice infrastructure
• Automatic language detection for multilingual voice agents serving EU markets

Enterprise Transcription:
• Contact center analytics with multilingual transcription
• Medical, legal, and financial transcription across EU official languages
• Earnings calls and compliance monitoring for multinational organizations

Live Captioning & Accessibility:
• Real-time multilingual captioning for meetings, webinars, and broadcasts
• Subtitle generation across EU official languages with automatic punctuation
• Accessibility compliance for multimedia content

High-Volume Processing:
• Batch transcription of large audio archives across multiple languages
• Media and content workflows requiring high-throughput multilingual processing
‍

Related models

Model specifications

Model data

Model provider
NVIDIA
Type
Audio
Transcribe
Deployment
Serverless
On-Demand Dedicated
Endpoint
nvidia/parakeet-tdt-0.6b-v3
Price
$0.0015 / min
Input modalities
Audio
Output modalities
Text

Released
August 14, 2025
Category
Transcribe

Run in Playground

Quickstart docs

Deploy model

NVIDIA Parakeet TDT 0.6B v3

About model

API usage

Model card

Prompting

Applications & use cases