Models / Cartesia
Audio

Cartesia Sonic-3

Low-latency, ultra-realistic voice model, served in partnership with Cartesia.

About model

Cartesia Sonic-3 converts text to speech with high expressiveness and naturalness. Its key strength lies in voice quality and low-latency synthesis. Suitable for developers requiring fast, high-fidelity speech generation.

  • API usage

    • cURL
    • Python
    • Typescript

    Endpoint:

    cartesia/sonic-3

    curl --location 'https://api.together.ai/v1/audio/generations' \
      --header 'Content-Type: application/json' \
      --header 'Authorization: Bearer $TOGETHER_API_KEY' \
      --output speech.mp3 \
      --data '{
        "input": "Today is a wonderful day to build something people love!",
        "voice": "helpful woman",
        "response_format": "mp3",
        "sample_rate": 44100,
        "stream": false,
        "model": "cartesia/sonic-3"
      }'
    
    from together import Together
    
    client = Together()
    
    speech_file_path = "speech.mp3"
    
    response = client.audio.speech.create(
      model="cartesia/sonic-3",
      input="Today is a wonderful day to build something people love!",
      voice="helpful woman",
    )
        
    response.stream_to_file(speech_file_path)
    
    
    import Together from 'together-ai';
    
    const together = new Together();
    
    async function generateAudio() {
       const res = await together.audio.create({
        input: 'Today is a wonderful day to build something people love!',
        voice: 'helpful woman',
        response_format: 'mp3',
        sample_rate: 44100,
        stream: false,
        model: 'cartesia/sonic-3',
      });
    
      if (res.body) {
        console.log(res.body);
        const nodeStream = Readable.from(res.body as ReadableStream);
        const fileStream = createWriteStream('./speech.mp3');
    
        nodeStream.pipe(fileStream);
      }
    }
    
    generateAudio();
    
    
Related models
  • Model provider
    Cartesia
  • Type
    Audio
  • Main use cases
    Text-to-Speech
  • Deployment
    Serverless
  • Context length
    Unlimited
  • Price

    $65.00 / 1M characters

  • Input modalities
    Text
  • Output modalities
    Audio