MiniMax Speech 2.8

Enterprise TTS with vocal emotes, voice cloning, and sub-250ms latency for production voice agents on Together AI.

About model

MiniMax Speech 2.8 is an enterprise-grade text-to-speech model delivering a 60% improvement in prosody and naturalness over Speech-2.6, validated by blind A/B testing with native speakers. It introduces Sound Tags—a text-injection system for fine-grained vocal emote control spanning laughter, breathing patterns, and articulations—alongside high-fidelity voice cloning and sub-250ms end-to-end latency. Deploy on Together AI dedicated endpoints for reliable, scalable voice infrastructure that integrates seamlessly with your LLM and STT workloads across 40+ languages.

Persuasive Man

0:00

"Perfect. Thank you for booking MiniMax Speech 2.8 Early Testing. Your appointment is scheduled for Wednesday, January 21. The contact number we have on file ends in 9921. We'll send a reminder SMS to this number 24 hours before your testing session. Finally, could you please confirm that all these details are correct?"

Golden

0:00

"Hey, it's me. How are ya? I hope you're having an awesome day. We actually had a bit of a crazy launch day yesterday, but I'm recovered and ready to roll. You're listening to this and probably thinking I'm just chatting into a microphone, right? But here's the twist. I'm actually not human. I am the new Speech 2.8 model from MiniMax. Crazy, right? If you listen closely, you can hear how I handle the pacing, the little breaths, and even that casual vibe. Have a great day."

Prosody Improvement

60%

Validated by blind A/B testing with native speakers vs Speech-2.6

End-to-End Latency

250ms

Real-time synthesis with streaming support

Supported Languages

40+

Including Chinese, English, Arabic, Spanish, French, Japanese, and more

Model key capabilities

Sound Tags: Text-injection vocal emote control for laughter, breathing patterns, and laryngeal articulations—enabling natural, expressive voice synthesis without post-processing
60% Prosody Improvement: Blind A/B testing with native speakers shows a 60% reduction in naturalness failures compared to Speech-2.6, shifting from robotic stability to human-like fluency
High-Fidelity Voice Cloning: Clone voices with strong accent and tonal similarity, maintaining speaker identity across diverse content types and languages
Real-Time Latency: Sub-250ms end-to-end latency with sub-300ms TTFT and full streaming support for AI telephony, customer service, and companionship applications

Quickstart guides

Audio

Speech-to-Text Docs

Model card
Architecture Overview:
• Enterprise TTS model with Sound Tags system for text-injection vocal emote control
• Supports 40+ languages with native speaker-validated prosody improvements
• High-fidelity voice cloning with strong accent and tonal preservation
• Full streaming synthesis support with sub-300ms time-to-first-token
• End-to-end latency under 250ms for real-time applications

Sound Tags:
• Laughter & Amusement: (laughs), (chuckle), (snorts)
• Respiratory Dynamics: (breath), (inhale), (exhale), (pant), (sighs), (sniffs), (gasps)
• Laryngeal & Oral Articulations: (clear-throat), (coughs), (groans), (emm), (humming), (lip-smacking), (burps), (hissing), (sneezes), (crying), (whistles)

Performance Characteristics:
• 60% improvement in prosody and naturalness vs Speech-2.6 (blind testing, 60 randomized audio pairs with native speakers)
• 54.5% win rate vs Speech-2.0 baseline in general listening tests
• Sub-300ms TTFT in production deployments
• 59% of naturalness failures in blind testing attributed to Speech-2.6, resolved in Speech-2.8
‍
Prompting
Together AI API Access:
• Access MiniMax Speech 2.8 via Together AI dedicated endpoints using the endpoint minimax/speech-2.8-turbo
• Authenticate using your Together AI API key in request headers
• Insert Sound Tags directly into input text for real-time vocal emote control
• Supports streaming synthesis for low-latency production deployments
• Available on Together AI dedicated infrastructure co-located with LLM and STT workloads
‍
Applications & use cases
Real-Time Voice Agents:
• AI customer service and support with sub-250ms end-to-end response times
• Interactive voice response systems requiring natural, expressive speech
• Live conversational AI with Sound Tags for breathing, pacing, and vocal emotes

AI Companionship & NPCs:
• Emotionally resonant character voices with Sound Tags for laughter, sighs, and gasps
• High-fidelity voice cloning for consistent character identity across sessions
• Human-like fluency validated by native speaker blind testing

Multilingual Voice Applications:
• Global contact centers with 40+ language support
• Voice cloning for localized brand voices across markets
• Content production, audiobooks, and storytelling with expressive narration control

Enterprise Voice Infrastructure:
• Voice agent platforms requiring reliable, scalable TTS on dedicated Together AI infrastructure
• Integration with LLM and STT for end-to-end co-located voice pipelines
• Production deployments with consistent performance and isolated workloads
‍

Related models

Model specifications

Model data

Model provider
MiniMax AI
Type
Audio
Main use cases
Text-to-Speech
Deployment
On-Demand Dedicated
Price
$30 / 1M characters + GPU hourly (by hardware)
Output modalities
Audio

Category
Audio

Quickstart docs

Deploy model

MiniMax Speech 2.8

About model

Model card

Prompting

Applications & use cases