Cartesia Sonic 3.5
#1 for naturalness with sub-90ms latency and 42-language support for production voice agents
About model
Cartesia Sonic 3.5 is Cartesia's latest text-to-speech model, ranked #1 for naturalness on the Artificial Analysis Speech Arena leaderboard. Built on State Space Models (SSMs), it delivers sub-90ms time-to-first-audio with expressive, conversational delivery across 42 languages at native quality. The model handles alphanumerics, order numbers, phone numbers, and IDs natively without preprocessing, with context-aware English pronunciation for heteronyms. Available on Together AI serverless and dedicated infrastructure co-located with LLM and STT workloads.
#1
Ranked #1 for naturalness among TTS models
Sub-90ms
State Space Model architecture for real-time voice agents
42
Native quality including English, Hindi, Spanish, French, German, Japanese, Arabic, and more
- Sub-90ms Latency: State Space Model architecture delivering sub-90ms time-to-first-audio for real-time conversational voice agents and interactive applications
- 42 Languages at Native Quality: English, Hindi, Spanish, French, German, Japanese, Arabic, Hebrew, and 34 more — each at native quality without degraded expressiveness
- Native Alphanumeric Handling: Order numbers, phone numbers, IDs, emails, and confirmation codes spoken naturally in every language without preprocessing or special formatting
- Expressive Conversational Delivery: Strong pacing, real emotional range, and context-aware English pronunciation for heteronyms — tuned for support and agent transcripts
Model card
Architecture Overview:
• State Space Model (SSM) architecture designed for live, synchronous interactions
• SSMs deliver ultra-low latency and greater efficiency at scale compared to transformer-based approaches
• Sub-90ms time-to-first-audio in production deployments
• 42 language support across English, Hindi, Spanish, French, German, Japanese, Chinese, Korean, Portuguese, Italian, Dutch, Polish, Russian, Arabic, Hebrew, and 27 more
• Model IDs: sonic-3.5 (auto-updated to latest stable snapshot), sonic-3.5-2026-05-04 (pinned), sonic-latest (beta testing)
Training Methodology:
• Trained for expressive, conversational delivery with strong pacing and real emotional range
• Optimized for support and agent transcripts requiring natural turn-taking
• Context-aware English pronunciation for heteronyms (read, bass, bow) resolved from surrounding words
• Native alphanumeric training: order numbers, phone numbers, IDs, and emails across all 42 languages
Performance Characteristics:
• #1 for naturalness on Artificial Analysis Speech Arena leaderboard
• Sub-90ms time-to-first-audio; 100ms p90 TTFB in production deployments
• Clean audio across all 42 languages with no artifacts
• Proven at scale across enterprise deployments
Prompting
Together AI API Access:
• Access Cartesia Sonic 3.5 via Together AI dedicated endpoints using the endpoint cartesia/sonic-3.5
• Authenticate using your Together AI API key in request headers
• Use sonic-3.5 for auto-updated stable releases; use sonic-3.5-2026-05-04 to pin to the current snapshot
• Supports streaming synthesis for real-time applications
• Available on Together AI serverless and dedicated infrastructure co-located with LLM and STT workloads
Applications & use cases
Real-Time Voice Agents:
• Sub-90ms TTS for natural conversational flow in live interactions
• Customer service and support agents with expressive, emotionally aware delivery
• Interactive voice response systems requiring accurate alphanumeric pronunciation
AI Telephony & Outbound:
• Outbound calling with natural pacing and sub-100ms first-audio for conversational feel
• Accurate pronunciation of order IDs, account numbers, and confirmation codes without preprocessing
• Multilingual outbound campaigns across 42 languages at native quality
Multilingual Voice Applications:
• Global contact centers with a single model covering 42 languages
• Localized voice experiences with native-quality delivery across markets
• Code-switching and multilingual content without separate per-language deployments
Enterprise Voice Infrastructure:
• Co-located with LLM and STT on Together AI for unified, low-latency voice pipelines
• Serverless for rapid evaluation and prototyping; dedicated for production isolation and consistent throughput
• SOC 2 compliant infrastructure for regulated enterprise deployments
- Model providerCartesia
- TypeAudio
- Main use casesText-to-Speech
- DeploymentMonthly ReservedOn-Demand Dedicated
- Price
$65.00 / 1M characters + GPU hourly (by hardware)
- Input modalitiesText
- Output modalitiesAudio
- ReleasedMay 3, 2026
- CategoryAudio