Orpheus TTS
Human-level speech generation with natural emotion and intonation

About model
Orpheus TTS is a breakthrough speech-LLM family built on Llama-3B that achieves human-level speech generation with natural emotion and intonation. Trained on 100k+ hours of English speech data, Orpheus demonstrates that open-source TTS can finally compete with—and surpass—closed-source models in real-world quality.
API usage
Endpoint:
Model card
Architecture Overview:
• Llama-3B backbone architecture adapted for speech-LLM applications
• Trained on 100k+ hours of English speech data and billions of text tokens
• SNAC audio tokenizer with 7 tokens per frame decoded as flattened sequence
• CNN-based detokenizer with sliding window modification for streaming without popping
Training Methodology:
• Pretrained on massive scale speech and text data to maintain language understanding
• Text token training boosts TTS performance while preserving semantic reasoning ability
• Trained exclusively on permissive/non-copyrighted audio data
• Fine-tuned models available for production use with 8 distinct voices (tara, leah, jess, leo, dan, mia, zac, zoe)
• Supports custom fine-tuning with as few as 50 examples per speaker
Performance Characteristics:
• Handles disfluencies naturally without artifacts
• Streaming inference faster than real-time playback on A100 40GB for 3B parameter model
• vLLM implementation enables efficient GPU utilizationApplications & use cases
Conversational AI & Virtual Assistants:
• Low-latency streaming enables natural conversational experiences
• Emotional intelligence and empathy expression for human-like interactions
• Multiple voice options for personalized assistant experiences
• Handles natural disfluencies and conversational patterns
Content Creation & Media:
• Audiobook narration with natural emotion and intonation
• Podcast generation with multiple speaker voices
• Video voiceovers with guided emotion control
• Character voices for gaming and animation
Enterprise & Production Applications:
• Contact center automation with empathetic customer service voices
• E-learning and training content with engaging narration
• Accessibility applications for text-to-speech needs
• Real-time translation and dubbing services
Creative Applications:
• Guided emotion and intonation for dramatic readings
• Role-playing and character voice generation
• Music and audio production with vocal synthesis
• Interactive storytelling with dynamic voice expressions
- Model providerCanopy Labs
- TypeAudio
- Main use casesSmall & FastText-to-Speech
- DeploymentServerless
- Endpoint
- Parameters3B
- Price
$0.27 / 1M characters
- Price
$0.85 / 1M characters
- Input modalitiesText
- Output modalitiesAudio
- ReleasedMarch 16, 2025
- Last updatedNovember 2, 2025
- External link
- CategoryAudio