Models / / / Orpheus TTS API
Orpheus TTS API
Human-level speech generation with natural emotion and intonation

This model is not currently supported on Together AI.
Visit our Models page to view all the latest models.
Orpheus TTS API Usage
Endpoint
How to use Orpheus TTS
Model details
Architecture Overview:
• Llama-3B backbone architecture adapted for speech-LLM applications
• Trained on 100k+ hours of English speech data and billions of text tokens
• SNAC audio tokenizer with 7 tokens per frame decoded as flattened sequence
• CNN-based detokenizer with sliding window modification for streaming without popping
Training Methodology:
• Pretrained on massive scale speech and text data to maintain language understanding
• Text token training boosts TTS performance while preserving semantic reasoning ability
• Trained exclusively on permissive/non-copyrighted audio data
• Fine-tuned models available for production use with 8 distinct voices (tara, leah, jess, leo, dan, mia, zac, zoe)
• Supports custom fine-tuning with as few as 50 examples per speaker
Performance Characteristics:
• Handles disfluencies naturally without artifacts
• Streaming inference faster than real-time playback on A100 40GB for 3B parameter model
• vLLM implementation enables efficient GPU utilization
• Supports realtime streaming with ~200ms latency, reducible to ~25-50ms with input streaming
Prompting Orpheus TTS
API Integration:
• Simple Python package installation via pip install orpheus-speech
• Built on vLLM for fast inference with standard LLM generation arguments
• Supports streaming and non-streaming modes for flexible deployment
• Compatible with Baseten for optimized fp8 and fp16 inference
• Available through multiple integration options including OpenAI-compatible APIs
Prompting Format:
• Finetuned model format: {voice_name}: Your text here (voices: tara, leah, jess, leo, dan, mia, zac, zoe)
• Emotion tags: , , , , , , ,
• Pretrained model supports zero-shot voice cloning via conditioning on text-speech pairs in prompt
• Standard LLM generation args: temperature, top_p, repetition_penalty≥1.1 required for stable generations
Advanced Techniques:
• Zero-shot voice cloning emerges from large pretraining data without explicit training objective
• Multiple text-speech pairs in prompt improve voice cloning reliability
• Increasing repetition_penalty and temperature makes the model speak faster
• Supports multilingual models in research preview (7 language pairs)
Optimization Strategies:
• Realtime output streaming with very low latency ~200ms
• Input streaming into KV cache reduces latency to ~25-50ms
• Simple fine-tuning process analogous to LLM tuning with Trainer and Transformers
• LoRA fine-tuning support for efficient adaptation
• Custom dataset preparation via HuggingFace datasets format
Applications & Use Cases
Conversational AI & Virtual Assistants:
• Low-latency streaming enables natural conversational experiences
• Emotional intelligence and empathy expression for human-like interactions
• Multiple voice options for personalized assistant experiences
• Handles natural disfluencies and conversational patterns
Voice Cloning & Customization:
• Zero-shot voice cloning without prior fine-tuning
• Custom voice creation with 50+ training examples for high quality results
• Production-ready finetuned models with 8 distinct voices
• Sample fine-tuning scripts provided for easy customization
Content Creation & Media:
• Audiobook narration with natural emotion and intonation
• Podcast generation with multiple speaker voices
• Video voiceovers with guided emotion control
• Character voices for gaming and animation
Enterprise & Production Applications:
• Contact center automation with empathetic customer service voices
• E-learning and training content with engaging narration
• Accessibility applications for text-to-speech needs
• Real-time translation and dubbing services
Creative Applications:
• Guided emotion and intonation for dramatic readings
• Role-playing and character voice generation
• Music and audio production with vocal synthesis
• Interactive storytelling with dynamic voice expressions
