Models / Qwen / / Qwen3-VL-32B-Instruct API
Qwen3-VL-32B-Instruct API
Powerful vision-language model with visual agent capabilities and extended context

This model is not currently supported on Together AI.
Visit our Models page to view all the latest models.
Qwen3-VL-32B-Instruct API Usage
Endpoint
How to use Qwen3-VL-32B-Instruct
Model details
Architecture Overview:
• 33B parameter vision-language model with native 256K context, expandable to 1 million tokens for extended reasoning.
• Interleaved-MRoPE: Full-frequency allocation over time, width, and height for enhanced long-horizon video reasoning.
• DeepStack: Fuses multi-level ViT features to capture fine-grained visual details and sharpen image-text alignment.
• Text-Timestamp Alignment: Precise, timestamp-grounded event localization for stronger video temporal modeling.
• Supports flash_attention_2 for better acceleration and memory efficiency in multi-image and video scenarios.
Training Methodology:
• Comprehensive multimodal pre-training with broader, higher-quality datasets enabling "recognize everything" capability.
• Enhanced training for visual recognition: celebrities, anime, products, landmarks, flora, fauna, and more.
• Expanded OCR training supporting 32 languages with robustness in low light, blur, tilt, rare characters, and jargon.
• Long-document structure parsing improvements for better understanding of complex document layouts.
Performance Characteristics:
• Strong multimodal reasoning: 64.8% MMMU, 93.3% DocVQA, 86.9% OCRBench, 88.4% AI2D.
• Advanced document understanding: 93.3% DocVQA, 94.0% ChartQA, 61.4% HallusionBench.
• Visual perception: 31.6% MathVision, 63.9% RealWorldQA, 88.4% AI2D.
• Video understanding: 41.9% VideoMMU with hours-long video processing capability.
• Strong text performance: 86.4% MMLU, 78.6% MMLU Pro, 68.9% GPQA, 70.2% BFCL v3.
• Visual agent capabilities: 36.4% OSWorld for GUI operation and task completion.
Prompting Qwen3-VL-32B-Instruct
Applications & Use Cases
• Generating Draw.io diagrams from natural language or visual inputs.
• Creating HTML/CSS/JS code from screenshots, mockups, or video walkthroughs.
• Visual-to-code conversion for rapid prototyping and development.
Spatial Reasoning & Embodied AI:
• Advanced spatial perception: judging object positions, viewpoints, and occlusions.
• 2D grounding for object detection and localization in images.
• 3D grounding for spatial reasoning in robotics and embodied AI applications.
Video Understanding:
• Processing hours-long video with 256K-1M context for full recall.
• Second-level indexing for precise temporal event localization.
• Video summarization, analysis, and question answering across extended durations.
Document Processing & OCR:
• Multi-language OCR supporting 32 languages with robustness to challenging conditions.
• Long-document structure parsing and understanding.
• Document question answering: 93.3% DocVQA, 94.0% ChartQA performance.
Multimodal Reasoning:
• STEM and mathematical reasoning from visual inputs.
• Causal analysis and logical, evidence-based answers combining vision and text.
• Visual question answering across diverse domains with "recognize everything" capability.
General Vision-Language Tasks:
• Celebrity, anime, product, landmark, flora, and fauna recognition.
• Chart understanding and data visualization interpretation.
• Image captioning, visual reasoning, and multi-image understanding.
