Models / Minimax AI
Audio

MiniMax Speech 2.8

Enterprise TTS with vocal emotes, voice cloning, and sub-250ms latency for production voice agents on Together AI.

About model

MiniMax Speech 2.8 is an enterprise-grade text-to-speech model delivering a 60% improvement in prosody and naturalness over Speech-2.6, validated by blind A/B testing with native speakers. It introduces Sound Tags—a text-injection system for fine-grained vocal emote control spanning laughter, breathing patterns, and articulations—alongside high-fidelity voice cloning and sub-250ms end-to-end latency. Deploy on Together AI dedicated endpoints for reliable, scalable voice infrastructure that integrates seamlessly with your LLM and STT workloads across 40+ languages.

Persuasive Man
0:00

"Perfect. Thank you for booking MiniMax Speech 2.8 Early Testing. Your appointment is scheduled for Wednesday, January 21. The contact number we have on file ends in 9921. We'll send a reminder SMS to this number 24 hours before your testing session. Finally, could you please confirm that all these details are correct?"

Golden
0:00

"Hey, it's me. How are ya? I hope you're having an awesome day. We actually had a bit of a crazy launch day yesterday, but I'm recovered and ready to roll. You're listening to this and probably thinking I'm just chatting into a microphone, right? But here's the twist. I'm actually not human. I am the new Speech 2.8 model from MiniMax. Crazy, right? If you listen closely, you can hear how I handle the pacing, the little breaths, and even that casual vibe. Have a great day."

Prosody Improvement

60%

Validated by blind A/B testing with native speakers vs Speech-2.6

End-to-End Latency

250ms

Real-time synthesis with streaming support

Supported Languages

40+

Including Chinese, English, Arabic, Spanish, French, Japanese, and more

Model key capabilities
  • Sound Tags: Text-injection vocal emote control for laughter, breathing patterns, and laryngeal articulations—enabling natural, expressive voice synthesis without post-processing
  • 60% Prosody Improvement: Blind A/B testing with native speakers shows a 60% reduction in naturalness failures compared to Speech-2.6, shifting from robotic stability to human-like fluency
  • High-Fidelity Voice Cloning: Clone voices with strong accent and tonal similarity, maintaining speaker identity across diverse content types and languages
  • Real-Time Latency: Sub-250ms end-to-end latency with sub-300ms TTFT and full streaming support for AI telephony, customer service, and companionship applications
Quickstart guides
  • Model card

    Architecture Overview:
    • Enterprise TTS model with Sound Tags system for text-injection vocal emote control
    • Supports 40+ languages with native speaker-validated prosody improvements
    • High-fidelity voice cloning with strong accent and tonal preservation
    • Full streaming synthesis support with sub-300ms time-to-first-token
    • End-to-end latency under 250ms for real-time applications

    Sound Tags:
    • Laughter & Amusement: (laughs), (chuckle), (snorts)
    • Respiratory Dynamics: (breath), (inhale), (exhale), (pant), (sighs), (sniffs), (gasps)
    • Laryngeal & Oral Articulations: (clear-throat), (coughs), (groans), (emm), (humming), (lip-smacking), (burps), (hissing), (sneezes), (crying), (whistles)

    Performance Characteristics:
    • 60% improvement in prosody and naturalness vs Speech-2.6 (blind testing, 60 randomized audio pairs with native speakers)
    • 54.5% win rate vs Speech-2.0 baseline in general listening tests
    • Sub-300ms TTFT in production deployments
    • 59% of naturalness failures in blind testing attributed to Speech-2.6, resolved in Speech-2.8

  • Prompting

    Together AI API Access:
    • Access MiniMax Speech 2.8 via Together AI dedicated endpoints using the endpoint minimax/speech-2.8-turbo
    • Authenticate using your Together AI API key in request headers
    • Insert Sound Tags directly into input text for real-time vocal emote control
    • Supports streaming synthesis for low-latency production deployments
    • Available on Together AI dedicated infrastructure co-located with LLM and STT workloads

  • Applications & use cases

    Real-Time Voice Agents:
    • AI customer service and support with sub-250ms end-to-end response times
    • Interactive voice response systems requiring natural, expressive speech
    • Live conversational AI with Sound Tags for breathing, pacing, and vocal emotes

    AI Companionship & NPCs:
    • Emotionally resonant character voices with Sound Tags for laughter, sighs, and gasps
    • High-fidelity voice cloning for consistent character identity across sessions
    • Human-like fluency validated by native speaker blind testing

    Multilingual Voice Applications:
    • Global contact centers with 40+ language support
    • Voice cloning for localized brand voices across markets
    • Content production, audiobooks, and storytelling with expressive narration control

    Enterprise Voice Infrastructure:
    • Voice agent platforms requiring reliable, scalable TTS on dedicated Together AI infrastructure
    • Integration with LLM and STT for end-to-end co-located voice pipelines
    • Production deployments with consistent performance and isolated workloads

Related models
  • Model provider
    Minimax AI
  • Type
    Audio
  • Main use cases
    Text-to-Speech
  • Deployment
    On-Demand Dedicated
  • Price

    $30 / 1M characters + GPU hourly (by hardware)

  • Output modalities
    Audio
  • Category
    Audio