MiniMax Speech 2.8
Enterprise TTS with vocal emotes, voice cloning, and sub-250ms latency for production voice agents on Together AI.
About model
MiniMax Speech 2.8 is an enterprise-grade text-to-speech model delivering a 60% improvement in prosody and naturalness over Speech-2.6, validated by blind A/B testing with native speakers. It introduces Sound Tags—a text-injection system for fine-grained vocal emote control spanning laughter, breathing patterns, and articulations—alongside high-fidelity voice cloning and sub-250ms end-to-end latency. Deploy on Together AI dedicated endpoints for reliable, scalable voice infrastructure that integrates seamlessly with your LLM and STT workloads across 40+ languages.
60%
Validated by blind A/B testing with native speakers vs Speech-2.6
250ms
Real-time synthesis with streaming support
40+
Including Chinese, English, Arabic, Spanish, French, Japanese, and more
- Sound Tags: Text-injection vocal emote control for laughter, breathing patterns, and laryngeal articulations—enabling natural, expressive voice synthesis without post-processing
- 60% Prosody Improvement: Blind A/B testing with native speakers shows a 60% reduction in naturalness failures compared to Speech-2.6, shifting from robotic stability to human-like fluency
- High-Fidelity Voice Cloning: Clone voices with strong accent and tonal similarity, maintaining speaker identity across diverse content types and languages
- Real-Time Latency: Sub-250ms end-to-end latency with sub-300ms TTFT and full streaming support for AI telephony, customer service, and companionship applications
Model card
Architecture Overview:
• Enterprise TTS model with Sound Tags system for text-injection vocal emote control
• Supports 40+ languages with native speaker-validated prosody improvements
• High-fidelity voice cloning with strong accent and tonal preservation
• Full streaming synthesis support with sub-300ms time-to-first-token
• End-to-end latency under 250ms for real-time applications
Sound Tags:
• Laughter & Amusement: (laughs), (chuckle), (snorts)
• Respiratory Dynamics: (breath), (inhale), (exhale), (pant), (sighs), (sniffs), (gasps)
• Laryngeal & Oral Articulations: (clear-throat), (coughs), (groans), (emm), (humming), (lip-smacking), (burps), (hissing), (sneezes), (crying), (whistles)
Performance Characteristics:
• 60% improvement in prosody and naturalness vs Speech-2.6 (blind testing, 60 randomized audio pairs with native speakers)
• 54.5% win rate vs Speech-2.0 baseline in general listening tests
• Sub-300ms TTFT in production deployments
• 59% of naturalness failures in blind testing attributed to Speech-2.6, resolved in Speech-2.8
Prompting
Together AI API Access:
• Access MiniMax Speech 2.8 via Together AI dedicated endpoints using the endpoint minimax/speech-2.8-turbo
• Authenticate using your Together AI API key in request headers
• Insert Sound Tags directly into input text for real-time vocal emote control
• Supports streaming synthesis for low-latency production deployments
• Available on Together AI dedicated infrastructure co-located with LLM and STT workloads
Applications & use cases
Real-Time Voice Agents:
• AI customer service and support with sub-250ms end-to-end response times
• Interactive voice response systems requiring natural, expressive speech
• Live conversational AI with Sound Tags for breathing, pacing, and vocal emotes
AI Companionship & NPCs:
• Emotionally resonant character voices with Sound Tags for laughter, sighs, and gasps
• High-fidelity voice cloning for consistent character identity across sessions
• Human-like fluency validated by native speaker blind testing
Multilingual Voice Applications:
• Global contact centers with 40+ language support
• Voice cloning for localized brand voices across markets
• Content production, audiobooks, and storytelling with expressive narration control
Enterprise Voice Infrastructure:
• Voice agent platforms requiring reliable, scalable TTS on dedicated Together AI infrastructure
• Integration with LLM and STT for end-to-end co-located voice pipelines
• Production deployments with consistent performance and isolated workloads
- Model providerMinimax AI
- TypeAudio
- Main use casesText-to-Speech
- DeploymentOn-Demand Dedicated
- Price
$30 / 1M characters + GPU hourly (by hardware)
- Output modalitiesAudio
- CategoryAudio