Models / Mistral AI /  / Voxtral-Mini-3B-2507 API

Voxtral-Mini-3B-2507 API

Multimodal audio understanding with transcription and reasoning

Try Now
Voxtral Mini

This model is not currently supported on Together AI.

Visit our Models page to view all the latest models.

Voxtral-Mini is an enhancement of Ministral 3B that incorporates state-of-the-art audio input capabilities while retaining best-in-class text performance. This compact 3B parameter model excels at speech transcription, translation, and audio understanding—delivering best-in-class ASR performance across the world's most widely spoken languages.

40min
Audio Understanding
32K context for long-form audio processing
8
Languages Supported
Automatic language detection & transcription
3B
Compact Parameters
Efficient audio + text in small model size
Why Voxtral-Mini?
Unified Audio-Text Model: No separate ASR pipeline needed—direct Q&A, summarization, and function calling from voice
Production-Ready Transcription: Dedicated mode with automatic language detection across 8 major languages
Long-Form Audio: Handle up to 40 minutes of audio for understanding tasks with 32K token context
Retains Text Excellence: Built on Ministral-3B backbone—maintains full text reasoning capabilities

Voxtral-Mini-3B-2507 API Usage

Endpoint

curl -X POST "https://api.together.xyz/v1/chat/completions" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Voxtral-Mini-3B-2507",
    "messages": [
      {
        "role": "user",
        "content": "What are some fun things to do in New York?"
      }
    ]
}'
curl -X POST "https://api.together.xyz/v1/images/generations" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Voxtral-Mini-3B-2507",
    "prompt": "Draw an anime style version of this image.",
    "width": 1024,
    "height": 768,
    "steps": 28,
    "n": 1,
    "response_format": "url",
    "image_url": "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"
  }'
curl -X POST https://api.together.xyz/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -d '{
    "model": "mistralai/Voxtral-Mini-3B-2507",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe what you see in this image."},
        {"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"}}
      ]
    }],
    "max_tokens": 512
  }'
curl -X POST https://api.together.xyz/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -d '{
    "model": "mistralai/Voxtral-Mini-3B-2507",
    "messages": [{
      "role": "user",
      "content": "Given two binary strings `a` and `b`, return their sum as a binary string"
    }]
  }'
curl -X POST https://api.together.xyz/v1/rerank \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -d '{
    "model": "mistralai/Voxtral-Mini-3B-2507",
    "query": "What animals can I find near Peru?",
    "documents": [
      "The giant panda (Ailuropoda melanoleuca), also known as the panda bear or simply panda, is a bear species endemic to China.",
      "The llama is a domesticated South American camelid, widely used as a meat and pack animal by Andean cultures since the pre-Columbian era.",
      "The wild Bactrian camel (Camelus ferus) is an endangered species of camel endemic to Northwest China and southwestern Mongolia.",
      "The guanaco is a camelid native to South America, closely related to the llama. Guanacos are one of two wild South American camelids; the other species is the vicuña, which lives at higher elevations."
    ],
    "top_n": 2
  }'
curl -X POST https://api.together.xyz/v1/embeddings \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Our solar system orbits the Milky Way galaxy at about 515,000 mph.",
    "model": "mistralai/Voxtral-Mini-3B-2507"
  }'
curl -X POST https://api.together.xyz/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -d '{
    "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
    "prompt": "A horse is a horse",
    "max_tokens": 32,
    "temperature": 0.1,
    "safety_model": "mistralai/Voxtral-Mini-3B-2507"
  }'
curl --location 'https://api.together.ai/v1/audio/generations' \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer $TOGETHER_API_KEY' \
  --output speech.mp3 \
  --data '{
    "input": "Today is a wonderful day to build something people love!",
    "voice": "helpful woman",
    "response_format": "mp3",
    "sample_rate": 44100,
    "stream": false,
    "model": "mistralai/Voxtral-Mini-3B-2507"
  }'
curl -X POST "https://api.together.xyz/v1/audio/transcriptions" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -F "model=mistralai/Voxtral-Mini-3B-2507" \
  -F "language=en" \
  -F "response_format=json" \
  -F "timestamp_granularities=segment"
curl --request POST \
  --url https://api.together.xyz/v2/videos \
  --header "Authorization: Bearer $TOGETHER_API_KEY" \
  --header "Content-Type: application/json" \
  --data '{
    "model": "mistralai/Voxtral-Mini-3B-2507",
    "prompt": "some penguins building a snowman"
  }'
curl --request POST \
  --url https://api.together.xyz/v2/videos \
  --header "Authorization: Bearer $TOGETHER_API_KEY" \
  --header "Content-Type: application/json" \
  --data '{
    "model": "mistralai/Voxtral-Mini-3B-2507",
    "frame_images": [{"input_image": "https://cdn.pixabay.com/photo/2020/05/20/08/27/cat-5195431_1280.jpg"}]
  }'

from together import Together

client = Together()

response = client.chat.completions.create(
  model="mistralai/Voxtral-Mini-3B-2507",
  messages=[
    {
      "role": "user",
      "content": "What are some fun things to do in New York?"
    }
  ]
)
print(response.choices[0].message.content)
from together import Together

client = Together()

imageCompletion = client.images.generate(
    model="mistralai/Voxtral-Mini-3B-2507",
    width=1024,
    height=768,
    steps=28,
    prompt="Draw an anime style version of this image.",
    image_url="https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png",
)

print(imageCompletion.data[0].url)


from together import Together

client = Together()

response = client.chat.completions.create(
    model="mistralai/Voxtral-Mini-3B-2507",
    messages=[{
    	"role": "user",
      "content": [
        {"type": "text", "text": "Describe what you see in this image."},
        {"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"}}
      ]
    }]
)
print(response.choices[0].message.content)

from together import Together

client = Together()
response = client.chat.completions.create(
  model="mistralai/Voxtral-Mini-3B-2507",
  messages=[
  	{
	    "role": "user", 
      "content": "Given two binary strings `a` and `b`, return their sum as a binary string"
    }
 ],
)

print(response.choices[0].message.content)

from together import Together

client = Together()

query = "What animals can I find near Peru?"

documents = [
  "The giant panda (Ailuropoda melanoleuca), also known as the panda bear or simply panda, is a bear species endemic to China.",
  "The llama is a domesticated South American camelid, widely used as a meat and pack animal by Andean cultures since the pre-Columbian era.",
  "The wild Bactrian camel (Camelus ferus) is an endangered species of camel endemic to Northwest China and southwestern Mongolia.",
  "The guanaco is a camelid native to South America, closely related to the llama. Guanacos are one of two wild South American camelids; the other species is the vicuña, which lives at higher elevations.",
]

response = client.rerank.create(
  model="mistralai/Voxtral-Mini-3B-2507",
  query=query,
  documents=documents,
  top_n=2
)

for result in response.results:
    print(f"Relevance Score: {result.relevance_score}")

from together import Together

client = Together()

response = client.embeddings.create(
  model = "mistralai/Voxtral-Mini-3B-2507",
  input = "Our solar system orbits the Milky Way galaxy at about 515,000 mph"
)

from together import Together

client = Together()

response = client.completions.create(
  model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
  prompt="A horse is a horse",
  max_tokens=32,
  temperature=0.1,
  safety_model="mistralai/Voxtral-Mini-3B-2507",
)

print(response.choices[0].text)

from together import Together

client = Together()

speech_file_path = "speech.mp3"

response = client.audio.speech.create(
  model="mistralai/Voxtral-Mini-3B-2507",
  input="Today is a wonderful day to build something people love!",
  voice="helpful woman",
)
    
response.stream_to_file(speech_file_path)

from together import Together

client = Together()
response = client.audio.transcribe(
    model="mistralai/Voxtral-Mini-3B-2507",
    language="en",
    response_format="json",
    timestamp_granularities="segment"
)
print(response.text)
from together import Together

client = Together()

# Create a video generation job
job = client.videos.create(
    prompt="A serene sunset over the ocean with gentle waves",
    model="mistralai/Voxtral-Mini-3B-2507"
)
from together import Together

client = Together()

job = client.videos.create(
    model="mistralai/Voxtral-Mini-3B-2507",
    frame_images=[
        {
            "input_image": "https://cdn.pixabay.com/photo/2020/05/20/08/27/cat-5195431_1280.jpg",
        }
    ]
)
import Together from 'together-ai';
const together = new Together();

const completion = await together.chat.completions.create({
  model: 'mistralai/Voxtral-Mini-3B-2507',
  messages: [
    {
      role: 'user',
      content: 'What are some fun things to do in New York?'
     }
  ],
});

console.log(completion.choices[0].message.content);
import Together from "together-ai";

const together = new Together();

async function main() {
  const response = await together.images.create({
    model: "mistralai/Voxtral-Mini-3B-2507",
    width: 1024,
    height: 1024,
    steps: 28,
    prompt: "Draw an anime style version of this image.",
    image_url: "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png",
  });

  console.log(response.data[0].url);
}

main();

import Together from "together-ai";

const together = new Together();
const imageUrl = "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png";

async function main() {
  const response = await together.chat.completions.create({
    model: "mistralai/Voxtral-Mini-3B-2507",
    messages: [{
      role: "user",
      content: [
        { type: "text", text: "Describe what you see in this image." },
        { type: "image_url", image_url: { url: imageUrl } }
      ]
    }]
  });
  
  console.log(response.choices[0]?.message?.content);
}

main();

import Together from "together-ai";

const together = new Together();

async function main() {
  const response = await together.chat.completions.create({
    model: "mistralai/Voxtral-Mini-3B-2507",
    messages: [{
      role: "user",
      content: "Given two binary strings `a` and `b`, return their sum as a binary string"
    }]
  });
  
  console.log(response.choices[0]?.message?.content);
}

main();

import Together from "together-ai";

const together = new Together();

const query = "What animals can I find near Peru?";
const documents = [
  "The giant panda (Ailuropoda melanoleuca), also known as the panda bear or simply panda, is a bear species endemic to China.",
  "The llama is a domesticated South American camelid, widely used as a meat and pack animal by Andean cultures since the pre-Columbian era.",
  "The wild Bactrian camel (Camelus ferus) is an endangered species of camel endemic to Northwest China and southwestern Mongolia.",
  "The guanaco is a camelid native to South America, closely related to the llama. Guanacos are one of two wild South American camelids; the other species is the vicuña, which lives at higher elevations."
];

async function main() {
  const response = await together.rerank.create({
    model: "mistralai/Voxtral-Mini-3B-2507",
    query: query,
    documents: documents,
    top_n: 2
  });
  
  for (const result of response.results) {
    console.log(`Relevance Score: ${result.relevance_score}`);
  }
}

main();


import Together from "together-ai";

const together = new Together();

const response = await client.embeddings.create({
  model: 'mistralai/Voxtral-Mini-3B-2507',
  input: 'Our solar system orbits the Milky Way galaxy at about 515,000 mph',
});

import Together from "together-ai";

const together = new Together();

async function main() {
  const response = await together.completions.create({
    model: "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
    prompt: "A horse is a horse",
    max_tokens: 32,
    temperature: 0.1,
    safety_model: "mistralai/Voxtral-Mini-3B-2507"
  });
  
  console.log(response.choices[0]?.text);
}

main();

import Together from 'together-ai';

const together = new Together();

async function generateAudio() {
   const res = await together.audio.create({
    input: 'Today is a wonderful day to build something people love!',
    voice: 'helpful woman',
    response_format: 'mp3',
    sample_rate: 44100,
    stream: false,
    model: 'mistralai/Voxtral-Mini-3B-2507',
  });

  if (res.body) {
    console.log(res.body);
    const nodeStream = Readable.from(res.body as ReadableStream);
    const fileStream = createWriteStream('./speech.mp3');

    nodeStream.pipe(fileStream);
  }
}

generateAudio();

import Together from "together-ai";

const together = new Together();

const response = await together.audio.transcriptions.create(
  model: "mistralai/Voxtral-Mini-3B-2507",
  language: "en",
  response_format: "json",
  timestamp_granularities: "segment"
});
console.log(response)
import Together from "together-ai";

const together = new Together();

async function main() {
  // Create a video generation job
  const job = await together.videos.create({
    prompt: "A serene sunset over the ocean with gentle waves",
    model: "mistralai/Voxtral-Mini-3B-2507"
  });
import Together from "together-ai";

const together = new Together();

const job = await together.videos.create({
  model: "mistralai/Voxtral-Mini-3B-2507",
  frame_images: [
    {
      input_image: "https://cdn.pixabay.com/photo/2020/05/20/08/27/cat-5195431_1280.jpg",
    }
  ]
});

How to use Voxtral-Mini-3B-2507

Model details

Architecture Overview:
• Built on Ministral-3B backbone with added audio encoder capabilities
• 32K token context length enabling up to 30 minutes for transcription, 40 minutes for understanding
• Dual-mode operation: dedicated transcription mode and full audio understanding mode
• Native multimodal architecture combining audio and text processing
• Automatic language detection across 8 major world languages

Training Methodology:
• Enhanced from Ministral-3B language model with state-of-the-art audio capabilities
• Trained for leading performance on speech transcription benchmarks
• Optimized for multilingual understanding across English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian
• Retains full text understanding capabilities from base Ministral-3B model
• Supports function-calling directly from voice input

Performance Characteristics:
• State-of-the-art Word Error Rate (WER) across FLEURS, Mozilla Common Voice, and Multilingual LibriSpeech benchmarks
• Competitive with specialized ASR models while adding audio understanding capabilities
• Outperforms models like Whisper large-v3, GPT-4o-mini-transcribe, and Scribe across multiple languages
• Dedicated transcription mode maximizes ASR performance
• Built-in Q&A and summarization without separate ASR preprocessing
• GPU requirements: ~9.5 GB GPU RAM in bf16 or fp16
• Compact 3B parameter size for efficient deployment

Prompting Voxtral-Mini-3B-2507

Applications & Use Cases

Speech Transcription & ASR:
• State-of-the-art automatic speech recognition across 8 languages
• Automatic language detection—no manual language specification needed
• Long-form transcription up to 30 minutes of continuous audio
• Production-ready dedicated transcription mode for maximum accuracy
• Ideal for meeting transcription, podcast transcription, and content creation

Voice Assistants & Conversational AI:
• Direct Q&A from audio without separate ASR pipeline
• Function-calling straight from voice for backend integration
• Multi-turn audio conversations with context retention
• API calls triggered directly by spoken user intents
• Perfect for voice-enabled applications and smart assistants

Audio Understanding & Analysis:
• Built-in summarization of audio content up to 40 minutes
• Extract insights and answer questions about audio without transcription step
• Structured summary generation from voice recordings
• Meeting analysis, lecture comprehension, podcast summarization
• Content moderation and audio classification

Multilingual Applications:
• 8 language support: English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian
• Automatic language detection eliminates manual configuration
• Cross-language transcription and translation capabilities
• Global customer service and support automation
• International content localization and accessibility

Enterprise & Production:
• Customer service call analysis and quality monitoring
• Voice-activated workflows and business process automation
• Accessibility solutions for hearing-impaired users
• Media and broadcasting transcription services
• Legal and medical audio documentation

Developer & Integration:
• Unified model eliminates complex ASR + LLM pipelines
• Direct audio-to-action workflows via function calling
• Retains full text capabilities for hybrid applications
• Easy deployment with vLLM and Transformers support
• Compact 3B size enables cost-effective scaling

Looking for production scale? Deploy on a dedicated endpoint

Deploy Voxtral-Mini-3B-2507 on a dedicated endpoint with custom hardware configuration, as many instances as you need, and auto-scaling.

Get started