Mist v3
English TTS with deterministic pronunciation for enterprise voice agents
About model
Rime Mist v3 is Rime's English text-to-speech model with a phoneme-first architecture delivering deterministic pronunciation control. The same input always produces the same phonetic output across all voices and calls, with pronunciation corrections applied in minutes without model retraining. Mist v3 features an updated inference engine optimized for high-throughput concurrent requests, SSML support for controllable pauses and speed adjustment, and full backwards compatibility with Mist v2 voices.
Deterministic
Same input produces same phonetic output every time
English
Optimized for enterprise voice deployments
SSML
Controllable pauses and inline speed adjustment
- Deterministic Pronunciation: Phoneme-first architecture ensuring the same input always produces the same phonetic output across all voices and calls
- Pronunciation Control: Correct brand names, medical terms, and domain vocabulary in minutes without model retraining via custom phoneme mapping
- High Throughput: Updated inference engine optimized for concurrent request handling at enterprise contact center volumes
- SSML Support: Controllable pauses and inline speed adjustment for fine-grained voice output control
Model card
Architecture Overview:
• Updated inference engine for the Mist text-to-speech model, optimized for high-throughput concurrent requests
• Phoneme-first architecture delivering deterministic pronunciation: the same input always produces the same phonetic output
• English language support
• SSML features including controllable pauses and inline speed adjustment
• Same voice catalog as Mist v2 with full backwards compatibility
Training Methodology:
• Trained on real customer service conversations for natural pacing, rhythm, and conversational cadence
• Optimized for clarity and pronunciation accuracy in production voice environments
• Robust text normalization layer providing deterministic control over brand names, medical terms, and domain-specific vocabulary
Performance Characteristics:
• Deterministic pronunciation: define a word once and it renders consistently across all voices and calls
• High-throughput concurrent request handling for enterprise contact center volumes
• Pronunciation corrections applied in minutes without model retraining
• SSML support for fine-grained control over pauses, pacing, and speed
Prompting
Together AI API Access:
• Access Rime Mist v3 via Together AI APIs using the endpoint rime-labs/rime-mist-v3
• Authenticate using your Together AI API key in request headers
• Use custom pronunciation by wrapping words in curly brackets with phonemizeBetweenBrackets enabled
• SSML support for controllable pauses and inline speed adjustment
• Available on Together AI dedicated infrastructure co-located with LLM and STT workloads
Applications & use cases
Enterprise Contact Centers:
• High-volume voice agent deployments requiring consistent pronunciation across millions of calls
• Customer support automation with natural conversational cadence
• IVR modernization with deterministic pronunciation control
Healthcare Voice Agents:
• Medication names, medical terms, and clinical vocabulary pronounced correctly every time
• Co-located with LLM and STT on Together AI HIPAA-ready infrastructure
• Deterministic pronunciation eliminates mispronunciation risk in patient interactions
Financial Services:
• Account numbers, routing numbers, and financial product names read clearly and consistently
• Compliance-grade voice output on SOC 2, PCI compliant infrastructure
• Brand name and proprietary term pronunciation locked across all voice channels
- Model providerRime
- TypeAudio
- Main use casesText-to-Speech
- DeploymentOn-Demand Dedicated
- Price
$10 / 1M characters + GPU hourly (by hardware)
- Input modalitiesText
- Output modalitiesAudio
- CategoryAudio