Models / Mistral AI / / Voxtral-Mini-3B-2507 API
Voxtral-Mini-3B-2507 API
Multimodal audio understanding with transcription and reasoning

This model is not currently supported on Together AI.
Visit our Models page to view all the latest models.
Voxtral-Mini-3B-2507 API Usage
Endpoint
How to use Voxtral-Mini-3B-2507
Model details
Architecture Overview:
• Built on Ministral-3B backbone with added audio encoder capabilities
• 32K token context length enabling up to 30 minutes for transcription, 40 minutes for understanding
• Dual-mode operation: dedicated transcription mode and full audio understanding mode
• Native multimodal architecture combining audio and text processing
• Automatic language detection across 8 major world languages
Training Methodology:
• Enhanced from Ministral-3B language model with state-of-the-art audio capabilities
• Trained for leading performance on speech transcription benchmarks
• Optimized for multilingual understanding across English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian
• Retains full text understanding capabilities from base Ministral-3B model
• Supports function-calling directly from voice input
Performance Characteristics:
• State-of-the-art Word Error Rate (WER) across FLEURS, Mozilla Common Voice, and Multilingual LibriSpeech benchmarks
• Competitive with specialized ASR models while adding audio understanding capabilities
• Outperforms models like Whisper large-v3, GPT-4o-mini-transcribe, and Scribe across multiple languages
• Dedicated transcription mode maximizes ASR performance
• Built-in Q&A and summarization without separate ASR preprocessing
• GPU requirements: ~9.5 GB GPU RAM in bf16 or fp16
• Compact 3B parameter size for efficient deployment
Prompting Voxtral-Mini-3B-2507
Applications & Use Cases
Speech Transcription & ASR:
• State-of-the-art automatic speech recognition across 8 languages
• Automatic language detection—no manual language specification needed
• Long-form transcription up to 30 minutes of continuous audio
• Production-ready dedicated transcription mode for maximum accuracy
• Ideal for meeting transcription, podcast transcription, and content creation
Voice Assistants & Conversational AI:
• Direct Q&A from audio without separate ASR pipeline
• Function-calling straight from voice for backend integration
• Multi-turn audio conversations with context retention
• API calls triggered directly by spoken user intents
• Perfect for voice-enabled applications and smart assistants
Audio Understanding & Analysis:
• Built-in summarization of audio content up to 40 minutes
• Extract insights and answer questions about audio without transcription step
• Structured summary generation from voice recordings
• Meeting analysis, lecture comprehension, podcast summarization
• Content moderation and audio classification
Multilingual Applications:
• 8 language support: English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian
• Automatic language detection eliminates manual configuration
• Cross-language transcription and translation capabilities
• Global customer service and support automation
• International content localization and accessibility
Enterprise & Production:
• Customer service call analysis and quality monitoring
• Voice-activated workflows and business process automation
• Accessibility solutions for hearing-impaired users
• Media and broadcasting transcription services
• Legal and medical audio documentation
Developer & Integration:
• Unified model eliminates complex ASR + LLM pipelines
• Direct audio-to-action workflows via function calling
• Retains full text capabilities for hybrid applications
• Easy deployment with vLLM and Transformers support
• Compact 3B size enables cost-effective scaling
