Voxtral-Mini-3B-2507
Multimodal audio understanding with transcription and reasoning
About model
Voxtral-Mini is an enhancement of Ministral 3B that incorporates state-of-the-art audio input capabilities while retaining best-in-class text performance. This compact 3B parameter model excels at speech transcription, translation, and audio understanding—delivering best-in-class ASR performance across the world's most widely spoken languages.
40min
32K context for long-form audio processing
8
Automatic language detection & transcription
3B
Efficient audio + text in small model size
- Unified Audio-Text Model: No separate ASR pipeline needed—direct Q&A, summarization, and function calling on audio
- Production-Ready Transcription: Dedicated mode with automatic language detection across 8 major languages
- Long-Form Audio: Handle up to 40 minutes of audio for understanding tasks with 32K token context
- Retains Text Excellence: Built on Ministral-3B backbone—maintains full text reasoning capabilities alongside audio
API usage
Endpoint:
Model card
Architecture Overview:
• Built on Ministral-3B backbone with added audio encoder capabilities
• 32K token context length enabling up to 30 minutes for transcription, 40 minutes for understanding
• Dual-mode operation: dedicated transcription mode and full audio understanding mode
• Native multimodal architecture combining audio and text processing
• Automatic language detection across 8 major world languages
Training Methodology:
• Enhanced from Ministral-3B language model with state-of-the-art audio capabilities
• Trained for leading performance on speech transcription benchmarks
• Optimized for multilingual understanding across English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian
• Retains full text understanding capabilities from base Ministral-3B model
• Supports function-calling directly from voice input
Performance Characteristics:
• State-of-the-art Word Error Rate (WER) across FLEURS, Mozilla Common Voice, and Multilingual LibriSpeech benchmarks
• Competitive with specialized ASR models while adding audio understanding capabilities
• Outperforms models like Whisper large-v3, GPT-4o-mini-transcribe, and Scribe across multiple languages
• Dedicated transcription mode maximizes ASR performance
• Built-in Q&A and summarization without separate ASR preprocessing
• Compact 3B parameter size for efficient deployment
Applications & use cases
Speech Transcription & ASR:
• State-of-the-art automatic speech recognition across 8 languages
• Automatic language detection—no manual language specification needed
• Long-form transcription up to 30 minutes of continuous audio
• Production-ready dedicated transcription mode for maximum accuracy
• Ideal for meeting transcription, podcast transcription, and content creation
Voice Assistants & Conversational AI:
• Direct Q&A from audio without separate ASR pipeline
• Function-calling straight from voice for backend integration
• Multi-turn audio conversations with context retention
• API calls triggered directly by spoken user intents
• Perfect for voice-enabled applications and smart assistants
Audio Understanding & Analysis:
• Built-in summarization of audio content up to 40 minutes
• Extract insights and answer questions about audio without transcription step
• Structured summary generation from voice recordings
• Meeting analysis, lecture comprehension, podcast summarization
• Content moderation and audio classification
Multilingual Applications:
• 8 language support: English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian
• Automatic language detection eliminates manual configuration
• Cross-language transcription and translation capabilities
• Global customer service and support automation
• International content localization and accessibility
Enterprise & Production:
• Customer service call analysis and quality monitoring
• Voice-activated workflows and business process automation
• Accessibility solutions for hearing-impaired users
• Media and broadcasting transcription services
• Legal and medical audio documentation
Developer & Integration:
• Unified model eliminates complex ASR + LLM pipelines
• Direct audio-to-action workflows via function calling
• Retains full text capabilities for hybrid applications
• Easy deployment with vLLM and Transformers support
• Compact 3B size enables cost-effective scaling
- TypeAudio
- Main use casesSmall & FastSpeech-to-Text
- DeploymentServerless
- Endpoint
- Parameters3B
- Context length32K
- Input modalitiesAudio
- Output modalitiesText
- ReleasedJuly 1, 2025
- Last updatedNovember 2, 2025
- External link
- CategoryTranscribe