Models / Mistral AI
Audio

Voxtral-Mini-3B-2507

Multimodal audio understanding with transcription and reasoning

About model

Voxtral-Mini is an enhancement of Ministral 3B that incorporates state-of-the-art audio input capabilities while retaining best-in-class text performance. This compact 3B parameter model excels at speech transcription, translation, and audio understanding—delivering best-in-class ASR performance across the world's most widely spoken languages.

Audio Understanding

40min

32K context for long-form audio processing

Languages Supported

8

Automatic language detection & transcription

Compact Parameters

3B

Efficient audio + text in small model size

Model key capabilities
  • Unified Audio-Text Model: No separate ASR pipeline needed—direct Q&A, summarization, and function calling on audio
  • Production-Ready Transcription: Dedicated mode with automatic language detection across 8 major languages
  • Long-Form Audio: Handle up to 40 minutes of audio for understanding tasks with 32K token context
  • Retains Text Excellence: Built on Ministral-3B backbone—maintains full text reasoning capabilities alongside audio
  • API usage

    • cURL
    • Python
    • Typescript

    Endpoint:

    mistralai/Voxtral-Mini-3B-2507

    curl -X POST "https://api.together.xyz/v1/audio/transcriptions" \
      -H "Authorization: Bearer $TOGETHER_API_KEY" \
      -F "model=mistralai/Voxtral-Mini-3B-2507" \
      -F "language=en" \
      -F "response_format=json" \
      -F "timestamp_granularities=segment"
    
    from together import Together
    
    client = Together()
    response = client.audio.transcribe(
        model="mistralai/Voxtral-Mini-3B-2507",
        language="en",
        response_format="json",
        timestamp_granularities="segment"
    )
    print(response.text)
    
    import Together from "together-ai";
    
    const together = new Together();
    
    const response = await together.audio.transcriptions.create(
      model: "mistralai/Voxtral-Mini-3B-2507",
      language: "en",
      response_format: "json",
      timestamp_granularities: "segment"
    });
    console.log(response)
    
  • Model card

    Architecture Overview:
    • Built on Ministral-3B backbone with added audio encoder capabilities
    • 32K token context length enabling up to 30 minutes for transcription, 40 minutes for understanding
    • Dual-mode operation: dedicated transcription mode and full audio understanding mode
    • Native multimodal architecture combining audio and text processing
    • Automatic language detection across 8 major world languages

    Training Methodology:
    • Enhanced from Ministral-3B language model with state-of-the-art audio capabilities
    • Trained for leading performance on speech transcription benchmarks
    • Optimized for multilingual understanding across English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian
    • Retains full text understanding capabilities from base Ministral-3B model
    • Supports function-calling directly from voice input

    Performance Characteristics:
    • State-of-the-art Word Error Rate (WER) across FLEURS, Mozilla Common Voice, and Multilingual LibriSpeech benchmarks
    • Competitive with specialized ASR models while adding audio understanding capabilities
    • Outperforms models like Whisper large-v3, GPT-4o-mini-transcribe, and Scribe across multiple languages
    • Dedicated transcription mode maximizes ASR performance
    • Built-in Q&A and summarization without separate ASR preprocessing
    • Compact 3B parameter size for efficient deployment

  • Applications & use cases

    Speech Transcription & ASR:
    • State-of-the-art automatic speech recognition across 8 languages
    • Automatic language detection—no manual language specification needed
    • Long-form transcription up to 30 minutes of continuous audio
    • Production-ready dedicated transcription mode for maximum accuracy
    • Ideal for meeting transcription, podcast transcription, and content creation

    Voice Assistants & Conversational AI:
    • Direct Q&A from audio without separate ASR pipeline
    • Function-calling straight from voice for backend integration
    • Multi-turn audio conversations with context retention
    • API calls triggered directly by spoken user intents
    • Perfect for voice-enabled applications and smart assistants

    Audio Understanding & Analysis:
    • Built-in summarization of audio content up to 40 minutes
    • Extract insights and answer questions about audio without transcription step
    • Structured summary generation from voice recordings
    • Meeting analysis, lecture comprehension, podcast summarization
    • Content moderation and audio classification

    Multilingual Applications:
    • 8 language support: English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian
    • Automatic language detection eliminates manual configuration
    • Cross-language transcription and translation capabilities
    • Global customer service and support automation
    • International content localization and accessibility

    Enterprise & Production:
    • Customer service call analysis and quality monitoring
    • Voice-activated workflows and business process automation
    • Accessibility solutions for hearing-impaired users
    • Media and broadcasting transcription services
    • Legal and medical audio documentation

    Developer & Integration:
    • Unified model eliminates complex ASR + LLM pipelines
    • Direct audio-to-action workflows via function calling
    • Retains full text capabilities for hybrid applications
    • Easy deployment with vLLM and Transformers support
    • Compact 3B size enables cost-effective scaling

Related models
  • Model provider
    Mistral AI
  • Type
    Audio
  • Main use cases
    Small & Fast
    Speech-to-Text
  • Deployment
    Serverless
  • Parameters
    3B
  • Context length
    32K
  • Input modalities
    Audio
  • Output modalities
    Text
  • Released
    July 1, 2025
  • Last updated
    November 2, 2025
  • External link
  • Category
    Transcribe