Model Library

Announcing the fastest inference for realtime voice AI agents

November 4, 2025

・

Rajas Bansal, Sahil Yadav, Garima Dhanania, Sri Yanamandra, Charles Zedlewski, Zain Hasan, Derek Petersen, Blaine Kasten, Sonny Khan, Rishabh Bhargava

Summary

Streaming Whisper speech-to-text (STT): Continuous transcription over WebSocket APIs optimized for voice agents
First serverless open-source text-to-speech (TTS): Orpheus (high-fidelity) and Kokoro (ultra-low latency) available through REST and WebSocket APIs without dedicated infrastructure
‍Voxtral transcription and speaker diarization: Premium multilingual transcription model and automatic speaker identification for batch processing

Voice interfaces are one of the hallmarks of a truly AI native application. From transcription to speech-to-code to outbound calling to custom podcasts, voice makes applications engaging and productive. But developers often have to piece together a number of specialized voice services to ship a single voice application. This tends to slow development while adding complexity, latency and cost.

We’re pleased to announce the addition of a greatly expanded set of high performance, low latency voice infrastructure to our cloud. We’ve worked hard to provide voice services that are frontier quality, developer friendly and very low latency.

With these additions, we’ve expanded our voice offering from transcription to a full set of building blocks that can power some or all of an application’s voice pipeline. These services support real-time and batch patterns in developer-friendly serverless and dedicated form factors.

‍Streaming speech-to-text for voice agents

Streaming Whisper

Traditional batch transcription waits for complete audio files. Voice agents need to process speech as it arrives, and intelligently detect when users finish speaking.

We've built the industry's fastest speech-to-text API by combining optimized model inference with intelligent system design — WebSocket streaming to eliminate connection overhead, carefully tuned voice activity detection (VAD), and purpose-built infrastructure for realtime audio processing. The result: Whisper running in real time with minimal quality degradation, completing transcripts up to 35% faster than alternatives.

The key is optimizing for time-to-complete-transcript, not just time-to-first-token. Voice agents need to know precisely when a user stops speaking to begin formulating responses. Our VAD tuning ensures your agent responds at the right moment, not too early (cutting users off) or too late (creating dead air).

STREAMING WHISPER

Real-time transcription with industry-leading latency. Carefully tuned voice activity detection for natural conversation flow.

$0.0035/min

Try now

Text-to-speech: Serverless open-source models

Together AI is the first cloud to provide serverless open-source text-to-speech models. No more spinning up dedicated instances for sporadic TTS needs — both models are available through REST APIs for batch generation and WebSocket APIs for realtime streaming.

Orpheus TTS: Natural voice quality

Orpheus delivers natural, expressive speech with multiple voice options suitable for customer-facing applications. At 187ms time-to-first-byte, it outpaces premium providers while approaching the speed of lighter models. The result: professional voice quality without sacrificing the responsiveness voice agents require.

ORPHEUS TTS

High-fidelity voice generation with natural prosody. 187ms average time-to-first-byte—faster than premium proprietary providers.

$15/1M chars

Try now

Kokoro TTS

When every millisecond counts, Kokoro delivers. With 97ms baseline TTFB, it's built for applications where response speed trumps all else. This predictable performance makes it ideal for high-volume voice agent deployments where cost and latency are critical.

KOKORO TTS

Ultra-fast production-scale voice. 97ms time- to-first-byte—more than 2x faster than alternatives with consistent performance under load.

$4/1M chars

Try now

New audio transcriptions

Two new capabilities expand our audio transcriptions API for batch processing workflows:

Voxtral Mini

Voxtral Mini is a higher-accuracy transcription model from Mistral AI, optimized for European languages and challenging audio conditions. Voxtral delivers measurably lower word error rates than standard Whisper — ideal for applications where transcription mistakes create liability or operational overhead.

VOXTRAL

Premium multilingual transcription optimized for European languages and challenging audio conditions with measurably lower word error rates.

$0.0030/min

Try now

Speaker Diarization

Automatically identify and label different speakers in recorded audio. Transform raw transcripts into structured conversations showing who said what and when — essential for meeting transcription, call center quality assurance, and multi-party conversation review.

Built for production voice agents

Three architectural decisions make Together AI's audio infrastructure uniquely suited for production voice agents:

Latency: Response times that enable natural conversation

Human conversation flows at a specific pace. Responses that take longer than 500ms feel unnatural. Beyond 2 seconds, users assume the system has failed. Every additional 100ms of latency measurably decreases user satisfaction and task completion rates.

Our infrastructure eliminates unnecessary latency at every layer. WebSocket connections stay alive, avoiding TCP handshake overhead. Models run on the same GPU clusters as your LLMs, eliminating cross-provider networking. Most critically, our optimized serving delivers consistent sub-200ms TTS and millisecond-accurate transcription even during traffic spikes.

Real-world impact: When a customer calls to change their flight, every second of delay increases the chance they'll hang up. The voice agent must capture their request, process it, and begin responding — all within the natural rhythm of human conversation.

Quality: Accurate transcription and natural-sounding speech

Voice agents fail when transcription errors cascade through the conversation. A misheard account number becomes a failed lookup. A garbled product name triggers the wrong workflow. Poor voice quality immediately signals "cheap automation" regardless of underlying intelligence.

That's why we offer multiple quality tiers. Streaming Whisper handles realtime transcription with enough accuracy for natural conversation. When precision matters — legal depositions, medical consultations, financial transactions — Voxtral's superior accuracy justifies its premium pricing. On the output side, Orpheus provides the natural, expressive voices users expect from professional services, while Kokoro offers clear, efficient speech for high-volume informational use cases.

Consider a healthcare scheduling bot: It must accurately capture medication names, understand accented speech, and respond with appropriate empathy. Quality failures at any layer break user trust and force expensive human escalation.

Scale: Consistent performance under production load

Voice infrastructure that performs well in demos but fails under production load creates a trust problem. Users who experience degraded service during peak hours learn to avoid the system entirely.

Our infrastructure maintains performance as load scales. A unique optimization in our WebSocket implementation allows multiplexing multiple conversations through single connections — critical for platforms like contact center software handling hundreds of simultaneous calls. Instead of managing thousands of individual WebSocket connections (with associated memory and networking overhead), you can efficiently route multiple isolated audio streams through shared connections.

This same approach to scale applies across our stack. Geographic distribution ensures low latency regardless of user location. Automatic scaling handles traffic spikes without manual intervention. The result: voice agents that perform identically whether handling 10 or 10,000 concurrent conversations.

Try it now

    
    from together import Together
    import asyncio

    client = Together()

    async def handle_conversation():
        # Listen to user input
        transcription = client.audio.transcriptions.create(
            model="whisper-large-v3",
            file=audio_stream,
            language="en"
        )
        
        # Generate response while monitoring for interruptions
        async with client.audio.speech.create(
            model="canopylabs/orpheus-3b-0.1-ft",
            input="I can help with that-",
            voice="tara"
        ) as tts:
            
            audio_playback = asyncio.create_task(play_audio(tts))
            
            # Simultaneously monitor for user interruptions
            async for chunk in transcription:
                if "wait" in chunk.text.lower() or "actually" in chunk.text.lower():
                    audio_playback.cancel()
                    # Adapt to new user input
                    break

Get started:

Playground: Test audio models before building
Speech-to-Text Documentation: Complete API reference for transcription
Text-to-Speech Documentation: Complete API reference for voice generation
Model library: Complete model specifications

For production deployments:
Contact our sales team for enterprise options and dedicated infrastructure.

LOREM IPSUM

Tag

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

$0.030/image

Try it out

LOREM IPSUM

Tag

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

$0.030/image

Try it out

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum
Bullet point goes here lorem ipsum

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

List Item #2

List Item #3

Build

Benefits included:

✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

Grow

Benefits included:

✔ Up to $30K in free platform credits*
✔ 6 hours of free forward-deployed engineering time.

Funding: ＄5M-$10M

Scale

Benefits included:

✔ Up to $50K in free platform credits*
✔ 10 hours of free forward-deployed engineering time.

Funding: ＄10M-＄25M

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:

‍Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:

Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:

Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.

A. 2.08*1e-1 m
B. 2.08*1e-9 m
C. 2.08*1e-6 m
D. 2.08*1e-3 m

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:

Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by

A. releasing nitrogen in the soil.
B. crowding out non-native species.
C. adding carbon dioxide to the atmosphere.
D. removing water from the soil and returning it to the atmosphere.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:

Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:

Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?

Links in this
article

Playground

Speech-to-Text Documentation

Text-to-Speech Documentation

‍