Kokoro-82M

Fast, cost-efficient TTS with quality comparable to larger models

About model

Kokoro is an ultra-lightweight TTS model with just 82 million parameters that proves size doesn't determine quality. Despite being dramatically smaller than competitors, Kokoro delivers comparable speech quality while being significantly faster and more cost-efficient. With Apache-2.0 licensing and $1000 total training cost, it's the most accessible production-grade TTS model available.

Parameters

82M

Compact architecture, blazing-fast inference

Per Hour of Audio

$0.06

Market rate API deployment ($1/M characters)

Languages & Voices

8 × 54

Multilingual support in v1.0 release

Model key capabilities

Extreme Efficiency: Quality matching larger models at a fraction of the computational cost
Truly Open: Apache-2.0 licensed—deploy in production, personal projects, anywhere without restrictions
Accessible Training: Total cost under $1000 (1000 A100 GPU hours) makes it reproducible for the community
Battle-Tested: Deployed in numerous commercial APIs and real-world production environments

Quickstart guides

Audio

Open NotebookLM: PDF to Podcast

API usage

cURL
Python
Typescript

Endpoint:

hexgrad/Kokoro-82M

curl --location 'https://api.together.ai/v1/audio/generations' \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer $TOGETHER_API_KEY' \
  --output speech.mp3 \
  --data '{
    "input": "Today is a wonderful day to build something people love!",
    "voice": "helpful woman",
    "response_format": "mp3",
    "sample_rate": 44100,
    "stream": false,
    "model": "hexgrad/Kokoro-82M"
  }'

from together import Together

client = Together()

speech_file_path = "speech.mp3"

response = client.audio.speech.create(
  model="hexgrad/Kokoro-82M",
  input="Today is a wonderful day to build something people love!",
  voice="helpful woman",
)
    
response.stream_to_file(speech_file_path)

import Together from 'together-ai';

const together = new Together();

async function generateAudio() {
   const res = await together.audio.create({
    input: 'Today is a wonderful day to build something people love!',
    voice: 'helpful woman',
    response_format: 'mp3',
    sample_rate: 44100,
    stream: false,
    model: 'hexgrad/Kokoro-82M',
  });

  if (res.body) {
    console.log(res.body);
    const nodeStream = Readable.from(res.body as ReadableStream);
    const fileStream = createWriteStream('./speech.mp3');

    nodeStream.pipe(fileStream);
  }
}

generateAudio();

Model card
Architecture Overview:
• Based on StyleTTS 2 architecture with ISTFTNet vocoder
• 82 million parameter lightweight design optimized for efficiency
• Decoder-only architecture with no diffusion or encoder
• Uses misaki G2P (grapheme-to-phoneme) library for text processing
• Fine-tuned from StyleTTS2-LJSpeech base model

Training Methodology:
• Trained exclusively on permissive/non-copyrighted audio data and IPA phoneme labels
• v1.0: Few hundred hours of audio across 8 languages with 54 voices
• v0.19: Less than 100 hours for initial English-only release with 10 voices
• Total training cost: $1000 for 1000 hours of A100 80GB GPU time
• Uses public domain audio, Apache/MIT licensed audio, and synthetic audio from large providers
• Includes CC BY licensed datasets: Koniwa tnc (<1h, CC BY 3.0) and SIWIS (<11h, CC BY 4.0)

Performance Characteristics:
• Delivers comparable quality to larger TTS models despite compact 82M size
• Significantly faster inference than larger alternatives
• Deployed in numerous commercial APIs and production projects
Applications & use cases
Production API Services:
• Deployed in numerous commercial APIs at market-leading prices
• Cost-effective TTS for high-volume applications (under $1 per million characters)
• Ideal for startups and businesses needing affordable voice synthesis
• Apache-2.0 license enables unrestricted commercial deployment

Personal & Developer Projects:
• Lightweight 82M parameters suitable for local deployment
• Easy integration into applications via simple Python API
• Perfect for indie developers and hobbyists

Multilingual Applications:
• Support for 8 languages with 54 voices in v1.0
• International content creation and localization
• Cross-language accessibility solutions
• Global customer service automation

Content Creation:
• Audiobook narration with cost-efficient processing
• Podcast and video voiceovers
• E-learning content with multiple language support
• Social media and marketing content generation

Accessibility & Assistive Technology:
• Screen readers and text-to-speech assistive devices
• Educational tools for language learning
• Communication aids for speech-impaired users
• Document reading applications
‍

Related models

Model specifications

Model data

Model provider
hexgrad
Type
Audio
Main use cases
Text-to-Speech
Deployment
Serverless
Endpoint
hexgrad/Kokoro-82M
Parameters
82M
Price
$10.00 / 1M characters
Input modalities
Text
Output modalities
Audio

Released
December 25, 2024
Last updated
November 2, 2025
External link
Provider docs
Category
Audio

Run in Playground

Quickstart docs

Deploy model

Kokoro-82M

About model

API usage

Model card

Applications & use cases