Models / Canopy Labs
Audio

Orpheus TTS

Human-level speech generation with natural emotion and intonation

About model

Orpheus TTS is a breakthrough speech-LLM family built on Llama-3B that achieves human-level speech generation with natural emotion and intonation. Trained on 100k+ hours of English speech data, Orpheus demonstrates that open-source TTS can finally compete with—and surpass—closed-source models in real-world quality.

  • API usage

    • cURL
    • Python
    • Typescript

    Endpoint:

    canopylabs/orpheus-3b-0.1-ft

    curl --location 'https://api.together.ai/v1/audio/generations' \
      --header 'Content-Type: application/json' \
      --header 'Authorization: Bearer $TOGETHER_API_KEY' \
      --output speech.mp3 \
      --data '{
        "input": "Today is a wonderful day to build something people love!",
        "voice": "helpful woman",
        "response_format": "mp3",
        "sample_rate": 44100,
        "stream": false,
        "model": "canopylabs/orpheus-3b-0.1-ft"
      }'
    
    from together import Together
    
    client = Together()
    
    speech_file_path = "speech.mp3"
    
    response = client.audio.speech.create(
      model="canopylabs/orpheus-3b-0.1-ft",
      input="Today is a wonderful day to build something people love!",
      voice="helpful woman",
    )
        
    response.stream_to_file(speech_file_path)
    
    
    import Together from 'together-ai';
    
    const together = new Together();
    
    async function generateAudio() {
       const res = await together.audio.create({
        input: 'Today is a wonderful day to build something people love!',
        voice: 'helpful woman',
        response_format: 'mp3',
        sample_rate: 44100,
        stream: false,
        model: 'canopylabs/orpheus-3b-0.1-ft',
      });
    
      if (res.body) {
        console.log(res.body);
        const nodeStream = Readable.from(res.body as ReadableStream);
        const fileStream = createWriteStream('./speech.mp3');
    
        nodeStream.pipe(fileStream);
      }
    }
    
    generateAudio();
    
    
  • Model card

    Architecture Overview:
    • Llama-3B backbone architecture adapted for speech-LLM applications
    • Trained on 100k+ hours of English speech data and billions of text tokens
    • SNAC audio tokenizer with 7 tokens per frame decoded as flattened sequence
    • CNN-based detokenizer with sliding window modification for streaming without popping

    Training Methodology:
    • Pretrained on massive scale speech and text data to maintain language understanding
    • Text token training boosts TTS performance while preserving semantic reasoning ability
    • Trained exclusively on permissive/non-copyrighted audio data
    • Fine-tuned models available for production use with 8 distinct voices (tara, leah, jess, leo, dan, mia, zac, zoe)
    • Supports custom fine-tuning with as few as 50 examples per speaker

    Performance Characteristics:
    • Handles disfluencies naturally without artifacts
    • Streaming inference faster than real-time playback on A100 40GB for 3B parameter model
    • vLLM implementation enables efficient GPU utilization

  • Applications & use cases

    Conversational AI & Virtual Assistants:
    • Low-latency streaming enables natural conversational experiences
    • Emotional intelligence and empathy expression for human-like interactions
    • Multiple voice options for personalized assistant experiences
    • Handles natural disfluencies and conversational patterns

    Content Creation & Media:
    • Audiobook narration with natural emotion and intonation
    • Podcast generation with multiple speaker voices
    • Video voiceovers with guided emotion control
    • Character voices for gaming and animation

    Enterprise & Production Applications:
    • Contact center automation with empathetic customer service voices
    • E-learning and training content with engaging narration
    • Accessibility applications for text-to-speech needs
    • Real-time translation and dubbing services

    Creative Applications:
    • Guided emotion and intonation for dramatic readings
    • Role-playing and character voice generation
    • Music and audio production with vocal synthesis
    • Interactive storytelling with dynamic voice expressions

Related models
  • Model provider
    Canopy Labs
  • Type
    Audio
  • Main use cases
    Small & Fast
    Text-to-Speech
  • Deployment
    Serverless
  • Parameters
    3B
  • Price

    $0.27 / 1M characters

  • Price

    $0.85 / 1M characters

  • Input modalities
    Text
  • Output modalities
    Audio
  • Released
    March 16, 2025
  • Last updated
    November 2, 2025
  • External link
  • Category
    Audio