Models / hexgrad
Audio

Kokoro-82M

Fast, cost-efficient TTS with quality comparable to larger models

About model

Kokoro is an ultra-lightweight TTS model with just 82 million parameters that proves size doesn't determine quality. Despite being dramatically smaller than competitors, Kokoro delivers comparable speech quality while being significantly faster and more cost-efficient. With Apache-2.0 licensing and $1000 total training cost, it's the most accessible production-grade TTS model available.

Parameters

82M

Compact architecture, blazing-fast inference

Per Hour of Audio

$0.06

Market rate API deployment ($1/M characters)

Languages & Voices

8 × 54

Multilingual support in v1.0 release

Model key capabilities
  • Extreme Efficiency: Quality matching larger models at a fraction of the computational cost
  • Truly Open: Apache-2.0 licensed—deploy in production, personal projects, anywhere without restrictions
  • Accessible Training: Total cost under $1000 (1000 A100 GPU hours) makes it reproducible for the community
  • Battle-Tested: Deployed in numerous commercial APIs and real-world production environments
  • API usage

    • cURL
    • Python
    • Typescript

    Endpoint:

    hexgrad/Kokoro-82M

    curl --location 'https://api.together.ai/v1/audio/generations' \
      --header 'Content-Type: application/json' \
      --header 'Authorization: Bearer $TOGETHER_API_KEY' \
      --output speech.mp3 \
      --data '{
        "input": "Today is a wonderful day to build something people love!",
        "voice": "helpful woman",
        "response_format": "mp3",
        "sample_rate": 44100,
        "stream": false,
        "model": "hexgrad/Kokoro-82M"
      }'
    
    from together import Together
    
    client = Together()
    
    speech_file_path = "speech.mp3"
    
    response = client.audio.speech.create(
      model="hexgrad/Kokoro-82M",
      input="Today is a wonderful day to build something people love!",
      voice="helpful woman",
    )
        
    response.stream_to_file(speech_file_path)
    
    
    import Together from 'together-ai';
    
    const together = new Together();
    
    async function generateAudio() {
       const res = await together.audio.create({
        input: 'Today is a wonderful day to build something people love!',
        voice: 'helpful woman',
        response_format: 'mp3',
        sample_rate: 44100,
        stream: false,
        model: 'hexgrad/Kokoro-82M',
      });
    
      if (res.body) {
        console.log(res.body);
        const nodeStream = Readable.from(res.body as ReadableStream);
        const fileStream = createWriteStream('./speech.mp3');
    
        nodeStream.pipe(fileStream);
      }
    }
    
    generateAudio();
    
    
  • Model card

    Architecture Overview:
    • Based on StyleTTS 2 architecture with ISTFTNet vocoder
    • 82 million parameter lightweight design optimized for efficiency
    • Decoder-only architecture with no diffusion or encoder
    • Uses misaki G2P (grapheme-to-phoneme) library for text processing
    • Fine-tuned from StyleTTS2-LJSpeech base model

    Training Methodology:
    • Trained exclusively on permissive/non-copyrighted audio data and IPA phoneme labels
    • v1.0: Few hundred hours of audio across 8 languages with 54 voices
    • v0.19: Less than 100 hours for initial English-only release with 10 voices
    • Total training cost: $1000 for 1000 hours of A100 80GB GPU time
    • Uses public domain audio, Apache/MIT licensed audio, and synthetic audio from large providers
    • Includes CC BY licensed datasets: Koniwa tnc (<1h, CC BY 3.0) and SIWIS (<11h, CC BY 4.0)

    Performance Characteristics:
    • Delivers comparable quality to larger TTS models despite compact 82M size
    • Significantly faster inference than larger alternatives
    • Deployed in numerous commercial APIs and production projects

  • Applications & use cases

    Production API Services:
    • Deployed in numerous commercial APIs at market-leading prices
    • Cost-effective TTS for high-volume applications (under $1 per million characters)
    • Ideal for startups and businesses needing affordable voice synthesis
    • Apache-2.0 license enables unrestricted commercial deployment

    Personal & Developer Projects:
    • Lightweight 82M parameters suitable for local deployment
    • Easy integration into applications via simple Python API
    • Perfect for indie developers and hobbyists

    Multilingual Applications:
    • Support for 8 languages with 54 voices in v1.0
    • International content creation and localization
    • Cross-language accessibility solutions
    • Global customer service automation

    Content Creation:
    • Audiobook narration with cost-efficient processing
    • Podcast and video voiceovers
    • E-learning content with multiple language support
    • Social media and marketing content generation

    Accessibility & Assistive Technology:
    • Screen readers and text-to-speech assistive devices
    • Educational tools for language learning
    • Communication aids for speech-impaired users
    • Document reading applications

Related models
  • Model provider
    hexgrad
  • Type
    Audio
  • Main use cases
    Text-to-Speech
  • Deployment
    Serverless
  • Parameters
    82M
  • Price

    $10.00 / 1M characters

  • Input modalities
    Text
  • Output modalities
    Audio
  • Released
    December 25, 2024
  • Last updated
    November 2, 2025
  • External link
  • Category
    Audio