Models / Qwen
Code
Chat

Qwen3.5-397B-A17B

Native multimodal model with efficient hybrid architecture for global deployment.

About model

Qwen3.5 is a native vision-language foundation model with 397B total parameters (17B activated) using sparse MoE architecture. Through early fusion training on multimodal tokens, Qwen3.5 achieves cross-generational parity with Qwen3 while outperforming Qwen3-VL models across reasoning, coding, agents, and visual understanding benchmarks. The model features Gated Delta Networks combined with sparse Mixture-of-Experts, delivering 8.6x-19x faster decoding than Qwen3-Max with minimal latency overhead. Reinforcement learning scaled across million-agent environments enables robust real-world adaptability, while expanded support to 201 languages and dialects enables inclusive worldwide deployment on Together AI's production infrastructure.

Total Parameters (17B activated)

397B

Native multimodal MoE architecture

Faster Decoding

8.6-19x

Hybrid Gated Delta Networks vs Qwen3-Max

Languages & Dialects

201

Global deployment with cultural understanding

Model key capabilities
  • Native Multimodal Foundation: Early fusion training across text, images, and video—87.8% MMLU-Pro, 88.6% MathVision, 87.5% VideoMME
  • Hybrid Efficiency: Gated Delta Networks + sparse MoE delivering 8.6-19x faster decoding with 17B activated parameters
  • Global Multilingual: 201 languages and dialects—88.5% MMMLU, 78.9% WMT24++ across 55 languages
  • Production-Ready Infrastructure: 99.9% SLA, available on serverless and dedicated infrastructure
  • API usage

    • cURL
    • Python
    • Typescript

    Endpoint:

    Qwen/Qwen3.5-397B-A17B

    curl -X POST https://api.together.xyz/v1/chat/completions \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $TOGETHER_API_KEY" \
      -d '{
        "model": "Qwen/Qwen3.5-397B-A17B",
        "messages": [{
          "role": "user",
          "content": "Given two binary strings `a` and `b`, return their sum as a binary string"
        }]
      }'
    
    from together import Together
    
    client = Together()
    response = client.chat.completions.create(
      model="Qwen/Qwen3.5-397B-A17B",
      messages=[
      	{
    	    "role": "user", 
          "content": "Given two binary strings `a` and `b`, return their sum as a binary string"
        }
     ],
    )
    
    print(response.choices[0].message.content)
    
    
    import Together from "together-ai";
    
    const together = new Together();
    
    async function main() {
      const response = await together.chat.completions.create({
        model: "Qwen/Qwen3.5-397B-A17B",
        messages: [{
          role: "user",
          content: "Given two binary strings `a` and `b`, return their sum as a binary string"
        }]
      });
      
      console.log(response.choices[0]?.message?.content);
    }
    
    main();
    
    
  • Model card

    Architecture Overview:
    • Native vision-language foundation model with early fusion training on multimodal tokens
    • 397B total parameters with 17B activated per forward pass via sparse MoE routing
    • 512 experts with 10 routed + 1 shared expert activated per token
    • Hybrid architecture: Gated Delta Networks + sparse Mixture-of-Experts
    • 60 layers with 15 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE)) layout
    • Gated DeltaNet: 64 linear attention heads for V, 16 for QK, head dimension 128
    • Gated Attention: 32 heads for Q, 2 for KV, head dimension 256, RoPE dimension 64
    • 248,320 token vocabulary (padded) with 15-25% lower token counts on technical datasets
    • 256K native context (262,144 tokens), extensible to 1M tokens via YaRN scaling
    • High-resolution vision: up to 1344x1344 pixels, UI screenshots with pixel-perfect element detection

    Training Methodology:
    • Early fusion multimodal training achieving near-100% efficiency vs text-only training
    • Reinforcement learning scaled across million-agent environments with progressively complex task distributions
    • Asynchronous RL frameworks supporting massive-scale agent scaffolds and environment orchestration
    • Trained for robust real-world adaptability across reasoning, coding, agents, and visual understanding
    • 201 languages and dialects with nuanced cultural and regional understanding
    • Multi-token prediction (MTP) training for enhanced inference efficiency

    Performance Characteristics:
    • Reasoning Excellence: 87.8% MMLU-Pro, 91.3% AIME26, 83.6% LiveCodeBench v6, 88.4% GPQA
    • Coding Leadership: 76.4% SWE-Bench Verified, 69.3% SWE-Bench Multilingual, 68.3% SecCodeBench
    • Agentic Performance: 69.0/78.6% BrowseComp, 74.0% WideSearch, 72.9% BFCL-V4, 86.7% TAU2-Bench
    • Multilingual SOTA: 88.5% MMMLU, 84.7% MMLU-ProX (29 languages), 78.9% WMT24++ (55 languages)
    • Vision Language: 88.6% MathVision, 90.3% MathVista, 85.0% MMMU, 79.0% MMMU-Pro
    • Video Understanding: 87.5% VideoMME (w/ sub), 84.7% VideoMMMU, 86.7% MLVU
    • Document Understanding: 90.8% OmniDocBench1.5, 93.1% OCRBench, 82.0% CC-OCR
    • Efficiency: 8.6x-19x faster decoding vs Qwen3-Max at 32k-256k context lengths

  • Applications & use cases

    Native Multimodal Reasoning:
    • Vision-language foundation with early fusion training across text, image, and video
    • 87.8% MMLU-Pro, 91.3% AIME26, 83.6% LiveCodeBench v6, 88.4% GPQA Diamond
    • Math vision: 88.6% MathVision, 90.3% MathVista (mini), 87.9% We-Math
    • High-resolution image understanding: up to 1344x1344 pixels, UI element detection
    • Video understanding: 87.5% VideoMME (w/ subtitles), 84.7% VideoMMMU, 86.7% MLVU
    • Document processing: 90.8% OmniDocBench1.5, 93.1% OCRBench, 82.0% CC-OCR

    SOTA Coding & Agentic Performance:
    • Autonomous coding: 76.4% SWE-Bench Verified, 69.3% SWE-Bench Multilingual, 68.3% SecCodeBench
    • Agentic workflows: 69.0/78.6% BrowseComp, 74.0% WideSearch, 86.7% TAU2-Bench
    • Tool orchestration: 72.9% BFCL-V4, 38.3% Tool Decathlon, 46.1% MCP-Mark
    • Visual agents: 65.6% ScreenSpot Pro, 62.2% OSWorld-Verified, 66.8% AndroidWorld
    • Search with tools: 48.3% HLE w/ tool, 70.3% BrowseComp-zh

    Global Multilingual Deployment:
    • 201 languages and dialects with nuanced cultural and regional understanding
    • Multilingual excellence: 88.5% MMMLU, 84.7% MMLU-ProX (29 languages)
    • Translation quality: 78.9% WMT24++ across 55 languages using XCOMET-XXL
    • Cross-lingual reasoning: 59.1% NOVA-63, 85.6% INCLUDE, 89.8% Global PIQA
    • Multilingual math: 73.3% PolyMATH, 88.2% MAXIFE across 23 settings
    • Instruction following: 76.5% IFBench, 67.6% MultiChallenge across languages

    Production Efficiency at Scale:
    • 8.6x-19x faster decoding than Qwen3-Max at 32k-256k context lengths
    • Hybrid Gated Delta Networks architecture with minimal latency overhead
    • 397B total parameters with only 17B activated per token via sparse MoE
    • 256K native context (262,144 tokens), extensible to 1M via YaRN RoPE scaling
    • Multi-token prediction (MTP) for enhanced inference throughput
    • 248,320 token vocabulary reducing token counts 15-25% on technical datasets

    Visual Agentic Workflows:
    • Desktop screenshot understanding with UI element identification and workflow planning
    • Pixel-perfect element detection for UI automation and testing
    • Executable action generation for autonomous task completion
    • Native tool calling for web search, code execution, and API orchestration
    • Visual agents: 65.6% ScreenSpot Pro, 62.2% OSWorld-Verified, 66.8% AndroidWorld

    Long-Context Knowledge Work:
    • 256K native context supporting entire codebases, video transcripts, technical reports
    • Extensible to 1M tokens for processing hour-scale videos and massive documents
    • Long-context benchmarks: 63.2% LongBench v2, 68.7% AA-LCR
    • Document understanding: 90.8% OmniDocBench1.5 with long-form comprehension
    • Eliminates chunking for large-scale technical documentation and research

    Thinking Mode for Complex Reasoning:
    • Default thinking mode generates step-by-step reasoning before final responses
    • Enhanced performance on math competitions (91.3% AIME26), coding challenges (83.6% LiveCodeBench)
    • Disableable for direct responses in conversational or low-latency applications
    • Configurable via API parameters without model changes
    • Optimal for complex problem-solving, mathematical proofs, algorithm design

Related models
  • Model provider
    Qwen
  • Type
    Code
    Chat
  • Main use cases
    Function Calling
    Vision
  • Features
    Function Calling
    JSON Mode
  • Speed
    Medium
  • Intelligence
    Very High
  • Deployment
    Serverless
    Monthly Reserved
  • Parameters
    403.4B
  • Context length
    256K
  • Input price

    $0.60 / 1M tokens

  • Output price

    $3.60 / 1M tokens

  • Input modalities
    Text
    Image
  • Output modalities
    Text
  • Released
    February 15, 2026
  • Last updated
    February 15, 2026
  • Quantization level
    FP4
  • External link
  • Category
    Code