Cartesia Sonic 3.5

#1 for naturalness with sub-90ms latency and 42-language support for production voice agents

About model

Cartesia Sonic 3.5 is Cartesia's latest text-to-speech model, ranked #1 for naturalness on the Artificial Analysis Speech Arena leaderboard. Built on State Space Models (SSMs), it delivers sub-90ms time-to-first-audio with expressive, conversational delivery across 42 languages at native quality. The model handles alphanumerics, order numbers, phone numbers, and IDs natively without preprocessing, with context-aware English pronunciation for heteronyms. Available on Together AI serverless and dedicated infrastructure co-located with LLM and STT workloads.

Speech Arena (Artificial Analysis)

Ranked #1 for naturalness among TTS models

TTS Latency

Sub-90ms

State Space Model architecture for real-time voice agents

Languages

Native quality including English, Hindi, Spanish, French, German, Japanese, Arabic, and more

Model key capabilities

Sub-90ms Latency: State Space Model architecture delivering sub-90ms time-to-first-audio for real-time conversational voice agents and interactive applications
42 Languages at Native Quality: English, Hindi, Spanish, French, German, Japanese, Arabic, Hebrew, and 34 more — each at native quality without degraded expressiveness
Native Alphanumeric Handling: Order numbers, phone numbers, IDs, emails, and confirmation codes spoken naturally in every language without preprocessing or special formatting
Expressive Conversational Delivery: Strong pacing, real emotional range, and context-aware English pronunciation for heteronyms — tuned for support and agent transcripts

Model card
Architecture Overview:
• State Space Model (SSM) architecture designed for live, synchronous interactions
• SSMs deliver ultra-low latency and greater efficiency at scale compared to transformer-based approaches
• Sub-90ms time-to-first-audio in production deployments
• 42 language support across English, Hindi, Spanish, French, German, Japanese, Chinese, Korean, Portuguese, Italian, Dutch, Polish, Russian, Arabic, Hebrew, and 27 more
• Model IDs: sonic-3.5 (auto-updated to latest stable snapshot), sonic-3.5-2026-05-04 (pinned), sonic-latest (beta testing)

Training Methodology:
• Trained for expressive, conversational delivery with strong pacing and real emotional range
• Optimized for support and agent transcripts requiring natural turn-taking
• Context-aware English pronunciation for heteronyms (read, bass, bow) resolved from surrounding words
• Native alphanumeric training: order numbers, phone numbers, IDs, and emails across all 42 languages

Performance Characteristics:
• #1 for naturalness on Artificial Analysis Speech Arena leaderboard
• Sub-90ms time-to-first-audio; 100ms p90 TTFB in production deployments
• Clean audio across all 42 languages with no artifacts
• Proven at scale across enterprise deployments
‍
Prompting
Together AI API Access:
• Access Cartesia Sonic 3.5 via Together AI dedicated endpoints using the endpoint cartesia/sonic-3.5
• Authenticate using your Together AI API key in request headers
• Use sonic-3.5 for auto-updated stable releases; use sonic-3.5-2026-05-04 to pin to the current snapshot
• Supports streaming synthesis for real-time applications
• Available on Together AI serverless and dedicated infrastructure co-located with LLM and STT workloads
‍
Applications & use cases
Real-Time Voice Agents:
• Sub-90ms TTS for natural conversational flow in live interactions
• Customer service and support agents with expressive, emotionally aware delivery
• Interactive voice response systems requiring accurate alphanumeric pronunciation

AI Telephony & Outbound:
• Outbound calling with natural pacing and sub-100ms first-audio for conversational feel
• Accurate pronunciation of order IDs, account numbers, and confirmation codes without preprocessing
• Multilingual outbound campaigns across 42 languages at native quality

Multilingual Voice Applications:
• Global contact centers with a single model covering 42 languages
• Localized voice experiences with native-quality delivery across markets
• Code-switching and multilingual content without separate per-language deployments

Enterprise Voice Infrastructure:
• Co-located with LLM and STT on Together AI for unified, low-latency voice pipelines
• Serverless for rapid evaluation and prototyping; dedicated for production isolation and consistent throughput
• SOC 2 compliant infrastructure for regulated enterprise deployments
‍

Related models

Model specifications

Model data

Model provider
Cartesia
Type
Audio
Main use cases
Text-to-Speech
Deployment
Dedicated
Price
$65.00 / 1M characters + GPU hourly (by hardware)
Input modalities
Text
Output modalities
Audio

Released
May 3, 2026
Category
Audio

Quickstart docs

Deploy model

Cartesia Sonic 3.5

About model

Model card

Prompting

Applications & use cases