Models / Rime
Audio

Mist v3

English TTS with deterministic pronunciation for enterprise voice agents

About model

Rime Mist v3 is Rime's English text-to-speech model with a phoneme-first architecture delivering deterministic pronunciation control. The same input always produces the same phonetic output across all voices and calls, with pronunciation corrections applied in minutes without model retraining. Mist v3 features an updated inference engine optimized for high-throughput concurrent requests, SSML support for controllable pauses and speed adjustment, and full backwards compatibility with Mist v2 voices.

Cupola

Professional Health

0:00

"Okay, cool, cool, gotcha. So first things first, let's get your date of birth and then we can get you set right up with an appointment."

Vespera

Casual Finance

0:00

"Oh, yeah, believe me, I definitely understand how daunting this finance stuff can be. But, you know, I'm here for you and we'll work through it together."

Eliphas

Calm Telecom

0:00

"Okay, so now the modem should be showing a blinking yellow light. Is that what you're seeing?"

Pronunciation

Deterministic

Same input produces same phonetic output every time

Language

English

Optimized for enterprise voice deployments

Control

SSML

Controllable pauses and inline speed adjustment

Model key capabilities
  • Deterministic Pronunciation: Phoneme-first architecture ensuring the same input always produces the same phonetic output across all voices and calls
  • Pronunciation Control: Correct brand names, medical terms, and domain vocabulary in minutes without model retraining via custom phoneme mapping
  • High Throughput: Updated inference engine optimized for concurrent request handling at enterprise contact center volumes
  • SSML Support: Controllable pauses and inline speed adjustment for fine-grained voice output control
  • Model card

    Architecture Overview:
    • Updated inference engine for the Mist text-to-speech model, optimized for high-throughput concurrent requests
    • Phoneme-first architecture delivering deterministic pronunciation: the same input always produces the same phonetic output
    • English language support
    • SSML features including controllable pauses and inline speed adjustment
    • Same voice catalog as Mist v2 with full backwards compatibility

    Training Methodology:
    • Trained on real customer service conversations for natural pacing, rhythm, and conversational cadence
    • Optimized for clarity and pronunciation accuracy in production voice environments
    • Robust text normalization layer providing deterministic control over brand names, medical terms, and domain-specific vocabulary

    Performance Characteristics:
    • Deterministic pronunciation: define a word once and it renders consistently across all voices and calls
    • High-throughput concurrent request handling for enterprise contact center volumes
    • Pronunciation corrections applied in minutes without model retraining
    • SSML support for fine-grained control over pauses, pacing, and speed

  • Prompting

    Together AI API Access:
    • Access Rime Mist v3 via Together AI APIs using the endpoint rime-labs/rime-mist-v3
    • Authenticate using your Together AI API key in request headers
    • Use custom pronunciation by wrapping words in curly brackets with phonemizeBetweenBrackets enabled
    • SSML support for controllable pauses and inline speed adjustment
    • Available on Together AI dedicated infrastructure co-located with LLM and STT workloads

  • Applications & use cases

    Enterprise Contact Centers:
    • High-volume voice agent deployments requiring consistent pronunciation across millions of calls
    • Customer support automation with natural conversational cadence
    • IVR modernization with deterministic pronunciation control

    Healthcare Voice Agents:
    • Medication names, medical terms, and clinical vocabulary pronounced correctly every time
    • Co-located with LLM and STT on Together AI HIPAA-ready infrastructure
    • Deterministic pronunciation eliminates mispronunciation risk in patient interactions

    Financial Services:
    • Account numbers, routing numbers, and financial product names read clearly and consistently
    • Compliance-grade voice output on SOC 2, PCI compliant infrastructure
    • Brand name and proprietary term pronunciation locked across all voice channels

Related models
  • Model provider
    Rime
  • Type
    Audio
  • Main use cases
    Text-to-Speech
  • Deployment
    On-Demand Dedicated
  • Price

    $10 / 1M characters + GPU hourly (by hardware)

  • Input modalities
    Text
  • Output modalities
    Audio
  • Category
    Audio