Models / Cartesia
Audio

Cartesia Sonic 3.5

#1 for naturalness with sub-90ms latency and 42-language support for production voice agents

About model

Cartesia Sonic 3.5 is Cartesia's latest text-to-speech model, ranked #1 for naturalness on the Artificial Analysis Speech Arena leaderboard. Built on State Space Models (SSMs), it delivers sub-90ms time-to-first-audio with expressive, conversational delivery across 42 languages at native quality. The model handles alphanumerics, order numbers, phone numbers, and IDs natively without preprocessing, with context-aware English pronunciation for heteronyms. Available on Together AI serverless and dedicated infrastructure co-located with LLM and STT workloads.

Speech Arena (Artificial Analysis)

#1

Ranked #1 for naturalness among TTS models

TTS Latency

Sub-90ms

State Space Model architecture for real-time voice agents

Languages

42

Native quality including English, Hindi, Spanish, French, German, Japanese, Arabic, and more

Model key capabilities
  • Sub-90ms Latency: State Space Model architecture delivering sub-90ms time-to-first-audio for real-time conversational voice agents and interactive applications
  • 42 Languages at Native Quality: English, Hindi, Spanish, French, German, Japanese, Arabic, Hebrew, and 34 more — each at native quality without degraded expressiveness
  • Native Alphanumeric Handling: Order numbers, phone numbers, IDs, emails, and confirmation codes spoken naturally in every language without preprocessing or special formatting
  • Expressive Conversational Delivery: Strong pacing, real emotional range, and context-aware English pronunciation for heteronyms — tuned for support and agent transcripts
  • Model card

    Architecture Overview:
    • State Space Model (SSM) architecture designed for live, synchronous interactions
    • SSMs deliver ultra-low latency and greater efficiency at scale compared to transformer-based approaches
    • Sub-90ms time-to-first-audio in production deployments
    • 42 language support across English, Hindi, Spanish, French, German, Japanese, Chinese, Korean, Portuguese, Italian, Dutch, Polish, Russian, Arabic, Hebrew, and 27 more
    • Model IDs: sonic-3.5 (auto-updated to latest stable snapshot), sonic-3.5-2026-05-04 (pinned), sonic-latest (beta testing)

    Training Methodology:
    • Trained for expressive, conversational delivery with strong pacing and real emotional range
    • Optimized for support and agent transcripts requiring natural turn-taking
    • Context-aware English pronunciation for heteronyms (read, bass, bow) resolved from surrounding words
    • Native alphanumeric training: order numbers, phone numbers, IDs, and emails across all 42 languages

    Performance Characteristics:
    • #1 for naturalness on Artificial Analysis Speech Arena leaderboard
    • Sub-90ms time-to-first-audio; 100ms p90 TTFB in production deployments
    • Clean audio across all 42 languages with no artifacts
    • Proven at scale across enterprise deployments

  • Prompting

    Together AI API Access:
    • Access Cartesia Sonic 3.5 via Together AI dedicated endpoints using the endpoint cartesia/sonic-3.5
    • Authenticate using your Together AI API key in request headers
    • Use sonic-3.5 for auto-updated stable releases; use sonic-3.5-2026-05-04 to pin to the current snapshot
    • Supports streaming synthesis for real-time applications
    • Available on Together AI serverless and dedicated infrastructure co-located with LLM and STT workloads

  • Applications & use cases

    Real-Time Voice Agents:
    • Sub-90ms TTS for natural conversational flow in live interactions
    • Customer service and support agents with expressive, emotionally aware delivery
    • Interactive voice response systems requiring accurate alphanumeric pronunciation

    AI Telephony & Outbound:
    • Outbound calling with natural pacing and sub-100ms first-audio for conversational feel
    • Accurate pronunciation of order IDs, account numbers, and confirmation codes without preprocessing
    • Multilingual outbound campaigns across 42 languages at native quality

    Multilingual Voice Applications:
    • Global contact centers with a single model covering 42 languages
    • Localized voice experiences with native-quality delivery across markets
    • Code-switching and multilingual content without separate per-language deployments

    Enterprise Voice Infrastructure:
    • Co-located with LLM and STT on Together AI for unified, low-latency voice pipelines
    • Serverless for rapid evaluation and prototyping; dedicated for production isolation and consistent throughput
    • SOC 2 compliant infrastructure for regulated enterprise deployments

Related models
  • Model provider
    Cartesia
  • Type
    Audio
  • Main use cases
    Text-to-Speech
  • Deployment
    Monthly Reserved
    On-Demand Dedicated
  • Price

    $65.00 / 1M characters + GPU hourly (by hardware)

  • Input modalities
    Text
  • Output modalities
    Audio
  • Released
    May 3, 2026
  • Category
    Audio