Mist v3 Omni
Multilingual TTS with deterministic pronunciation across four languages
About model
Rime Mist v3 Omni is the multilingual variant of Rime's Mist v3 text-to-speech model, supporting English, Spanish, French, and German with deterministic pronunciation control across all four languages. Built on the same phoneme-first architecture, it delivers consistent pronunciation of brand names, medical terms, and domain vocabulary across languages without model retraining. The model features an updated inference engine for high-throughput concurrent requests, SSML support, and consolidates multilingual TTS into a single model.
4
English; Spanish; French; German with deterministic pronunciation
Deterministic
Same input produces same phonetic output across all languages
SSML
Controllable pauses and inline speed adjustment
- Multilingual Deterministic Pronunciation: Phoneme-first architecture ensuring consistent pronunciation across English, Spanish, French, and German
- Single Model Multilingual: Consolidate four-language TTS into one model without language-specific routing or separate infrastructure
- Pronunciation Control: Correct brand names, medical terms, and domain vocabulary across all languages in minutes without model retraining
- SSML Support: Controllable pauses and inline speed adjustment for fine-grained voice output control in all supported languages
Model card
Architecture Overview:
• Multilingual variant of the Mist v3 inference engine supporting English, Spanish, French, and German
• Phoneme-first architecture delivering deterministic pronunciation across all four languages
• Optimized for high-throughput concurrent requests in multilingual production environments
• SSML features including controllable pauses and inline speed adjustment
• Same voice catalog compatibility with multilingual coverage
Training Methodology:
• Trained on real conversational data across English, Spanish, French, and German
• Optimized for natural prosody and pronunciation accuracy in each supported language
• Robust text normalization layer providing deterministic control over domain-specific vocabulary across languages
Performance Characteristics:
• Deterministic pronunciation across all four languages: define a word once and it renders consistently
• High-throughput concurrent request handling for multilingual enterprise deployments
• Pronunciation corrections applied in minutes without model retraining across any supported language
• SSML support for fine-grained control over pauses, pacing, and speed in all languages
Prompting
Together AI API Access:
• Access Rime Mist v3 Omni via Together AI APIs using the endpoint rime-labs/rime-mist-v3-omni
• Authenticate using your Together AI API key in request headers
• Supports English, Spanish, French, and German with language selection via API
• Use custom pronunciation by wrapping words in curly brackets with phonemizeBetweenBrackets enabled
• SSML support for controllable pauses and inline speed adjustment
• Available on Together AI dedicated infrastructure co-located with LLM and STT workloads
Applications & use cases
Global Contact Centers:
• Multilingual voice agent deployments across English, Spanish, French, and German markets
• Consistent pronunciation control across all four languages for brand and product terms
• Single model consolidating multilingual TTS infrastructure without language-specific routing
Healthcare Voice Agents:
• Medical terminology pronounced correctly across all supported languages
• Co-located with LLM and STT on Together AI HIPAA-ready infrastructure
• Deterministic pronunciation for patient-facing interactions in multilingual healthcare environments
Financial Services & Retail:
• Account identifiers, product names, and financial terms read consistently across languages
• Compliance-grade multilingual voice output on SOC 2, PCI compliant infrastructure
• Serve European and Latin American markets from a single model and voice pipeline
- Model providerRime
- TypeAudio
- Main use casesText-to-Speech
- DeploymentOn-Demand Dedicated
- Price
$10 / 1M characters + GPU hourly (by hardware)
- Input modalitiesText
- Output modalitiesAudio
- CategoryAudio