Models / Qwen

Qwen

Deploy Qwen3 and QwQ models on Together AI. Hybrid reasoning, agentic coding, and OpenAI-compatible API — open source under Apache 2.0.

Why Qwen on Together AI?

Designed for production workloads that need 
consistent performance and operational control.

Drop-in OpenAI replacement

Same API format, hybrid thinking mode, and multilingual support. Migrate from OpenAI with zero code changes.

From edge to frontier, one family

Models spanning sub-1B to 480B+ parameters with adaptive scaling for every use case and budget.

Open source, enterprise licensed

Apache 2.0 licensing gives you full commercial freedom. SOC 2 Type II certified, HIPAA compliant, US-based infrastructure.

Meet the Qwen family

Explore top-performing models across text, image, video, code, and voice.

New

Chat

Qwen3 8B Base

new

Code

Qwen3.5-397B-A17B

new

Code

Qwen3-Coder-Next

Chat

Qwen3 235B A22B Instruct 2507 FP8

Coming Soon

Image

Qwen Image Edit

Code

Qwen3-Coder 480B A35B Instruct

new

Vision

Qwen3-VL-32B-Instruct

Code

Qwen2.5 Coder 32B Instruct

Chat

Qwen QwQ-32B

Chat

Qwen3-Next-80B-A3B-Instruct

New

Chat

Qwen3-Next-80B-A3B-Thinking

New

Image

Qwen Image

Chat

Qwen3 235B A22B Thinking 2507 FP8

Vision

Qwen2.5-VL 72B Instruct

Chat

Qwen3 235B A22B FP8 Throughput

Chat

Qwen2.5 72B

Chat

Qwen2.5 7B Instruct Turbo

New

Chat

Qwen3 32B

New

Chat

Qwen3 0.6B

New

Chat

Qwen3 0.6B Base

New

Chat

Qwen3 1.7B

New

Chat

Qwen3 1.7B Base

New

Chat

Qwen3 14B Base

New

Chat

Qwen3 30B A3B

New

Chat

Qwen3 30B A3B Base

New

Chat

Qwen3 4B

New

Chat

Qwen3 4B Base

New

Chat

Qwen3 8B

Breakthrough technical innovations

Explore all the game-changing architectural advances that make Qwen models shine.

  • Mixture of Experts (MoE)

    Sparse expert routing activates only 37B out of 671B parameters for each token in V3. Advanced load balancing without auxiliary losses maintains performance while reducing computational cost.

  • Group Relative Policy Optimization

    New RL approach that removes separate value networks in RLHF, using grouped relative advantage estimation to cut compute requirements while maintaining training stability.

  • Native Reasoning Transparency

    First reasoning model to expose complete thinking process in <think> tags. Native reasoning capabilities built into model foundation through large-scale reinforcement learning.

  • MetaP Training

    First successful implementation of FP8 mixed precision training on a 671B parameter model. Pioneering reinforcement learning approach without supervised fine-tuning as preliminary step.

  • Multi-Head Latent Attention

    Innovative attention mechanism that reduces KV-cache memory requirements while maintaining modeling performance. Optimized for efficient inference deployment.

  • Multi-Token Prediction

    Novel training objective that allows the model to predict multiple tokens simultaneously. Enhanced performance and efficiency through advanced training techniques.

Deployment options

Run models using different deployment options depending on latency needs, traffic patterns, and infrastructure control.

  • Serverless

  • Inference

Serverless Inference

Real-time

A fully managed inference API that automatically scales with request volume.

Best for

Variable or unpredictable traffic

Rapid prototyping and iteration

Cost-sensitive or early-stage production workloads

Batch

Process massive workloads of up to 30 billion tokens asynchronously, at up to 50% less cost.

Best for

Classifying large datasets

Offline summarization

Synthetic data generation

Dedicated Inference

Dedicated Model Inference

An inference endpoint backed by reserved, isolated compute resources and the Together AI inference engine.

Best for

Predictable or steady traffic

Latency-sensitive applications

High-throughput production workloads

Dedicated Container Inference

Run inference with your own engine and model on fully-managed, scalable infrastructure.

Best for

Generative media models

Non-standard runtimes

Custom inference pipelines