Models / Qwen
Vision

Qwen3-VL-32B-Instruct

Powerful vision-language model with visual agent capabilities and extended context

About model

Qwen3-VL-32B-Instruct is the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades: superior text understanding & generation, deeper visual perception & reasoning, 256K native context (expandable to 1M), enhanced spatial and video dynamics comprehension, and stronger visual agent interaction capabilities for operating PC/mobile GUIs.
Native Context

256K

Expandable to 1M for hours-long video

DocVQA

93.3%

Advanced document understanding

OCR Languages

32

Expanded from 19 with robust OCR

Model key capabilities
  • Advanced Spatial Perception: 2D/3D grounding for spatial reasoning and embodied AI
  • Long Video Understanding: Hours-long video with full recall and second-level indexing
  • Visual Coding: Generates Draw.io/HTML/CSS/JS from images and videos
  • Model card

    Architecture Overview:
    • 33B parameter vision-language model with native 256K context, expandable to 1 million tokens for extended reasoning.
    • Interleaved-MRoPE: Full-frequency allocation over time, width, and height for enhanced long-horizon video reasoning.
    • DeepStack: Fuses multi-level ViT features to capture fine-grained visual details and sharpen image-text alignment.
    • Text-Timestamp Alignment: Precise, timestamp-grounded event localization for stronger video temporal modeling.
    • Supports flash_attention_2 for better acceleration and memory efficiency in multi-image and video scenarios.

    Training Methodology:
    • Comprehensive multimodal pre-training with broader, higher-quality datasets enabling "recognize everything" capability.
    • Enhanced training for visual recognition: celebrities, anime, products, landmarks, flora, fauna, and more.
    • Expanded OCR training supporting 32 languages with robustness in low light, blur, tilt, rare characters, and jargon.
    • Long-document structure parsing improvements for better understanding of complex document layouts.

    Performance Characteristics:
    • Strong multimodal reasoning: 64.8% MMMU, 93.3% DocVQA, 86.9% OCRBench, 88.4% AI2D.
    • Advanced document understanding: 93.3% DocVQA, 94.0% ChartQA, 61.4% HallusionBench.
    • Visual perception: 31.6% MathVision, 63.9% RealWorldQA, 88.4% AI2D.
    • Video understanding: 41.9% VideoMMU with hours-long video processing capability.
    • Strong text performance: 86.4% MMLU, 78.6% MMLU Pro, 68.9% GPQA, 70.2% BFCL v3.
    • Visual agent capabilities: 36.4% OSWorld for GUI operation and task completion.

  • Applications & use cases


    • Generating Draw.io diagrams from natural language or visual inputs.
    • Creating HTML/CSS/JS code from screenshots, mockups, or video walkthroughs.
    • Visual-to-code conversion for rapid prototyping and development.

    Spatial Reasoning & Embodied AI:
    • Advanced spatial perception: judging object positions, viewpoints, and occlusions.
    • 2D grounding for object detection and localization in images.
    • 3D grounding for spatial reasoning in robotics and embodied AI applications.

    Video Understanding:
    • Processing hours-long video with 256K-1M context for full recall.
    • Second-level indexing for precise temporal event localization.
    • Video summarization, analysis, and question answering across extended durations.

    Document Processing & OCR:
    • Multi-language OCR supporting 32 languages with robustness to challenging conditions.
    • Long-document structure parsing and understanding.
    • Document question answering: 93.3% DocVQA, 94.0% ChartQA performance.

    Multimodal Reasoning:
    • STEM and mathematical reasoning from visual inputs.
    • Causal analysis and logical, evidence-based answers combining vision and text.
    • Visual question answering across diverse domains with "recognize everything" capability.

    General Vision-Language Tasks:
    • Celebrity, anime, product, landmark, flora, and fauna recognition.
    • Chart understanding and data visualization interpretation.
    • Image captioning, visual reasoning, and multi-image understanding.

Related models
  • Model provider
    Qwen
  • Type
    Vision
  • Main use cases
    Small & Fast
    Function Calling
    Vision
  • Features
    Function Calling
  • Fine tuning
    Supported
  • Speed
    High
  • Intelligence
    High
  • Deployment
    Monthly Reserved
  • Parameters
    33B
  • Context length
    256K
  • Input price

    $0.50 / 1M tokens

  • Output price

    $1.50 / 1M tokens

  • Input modalities
    Text
    Image
  • Output modalities
    Text
  • Released
    October 19, 2025
  • Last updated
    February 24, 2026
  • Quantization level
    BF16
  • External link
  • Category
    Vision