Kimi K2 Thinking
State-of-the-art thinking agent with deep reasoning and tool orchestration
About model
Kimi K2 Thinking is Moonshot AI's most capable open-source thinking model, built as a thinking agent that reasons step-by-step while dynamically invoking tools. Setting new state-of-the-art records on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks, K2 Thinking dramatically scales multi-step reasoning depth while maintaining stable tool-use across 200–300 sequential calls — a breakthrough in long-horizon agency with native INT4 quantization for 2x inference speed.
44.9%
Expert-level reasoning across 100+ subjects
300
Stable long-horizon agency without drift
2x
Native INT4 quantization with QAT
- Deep Thinking & Tool Orchestration: End-to-end trained to interleave chain-of-thought reasoning with function calls for autonomous workflows
- Agentic Search Excellence: 60.2% BrowseComp, 56.3% Seal-0 — superior goal-directed web reasoning in information-rich environments
- Advanced Mathematical Reasoning: 99.1% AIME 2025 (w/ python), 95.1% HMMT 2025 — elite competition-level problem solving
- Production-Ready Efficiency: Native INT4 quantization achieving lossless 2x speed improvements with 256K context window
Model | AIME 2025 | GPQA Diamond | HLE | LiveCodeBench | MATH500 | SWE-bench verified |
|---|---|---|---|---|---|---|
Kimi K2 Thinking | 84.2% | Related open-source models | Competitor closed-source models | |||
90.5% | 34.2% | 78.7% | ||||
83.3% | 24.9% | 99.2% | 62.3% | |||
76.8% | 96.4% | 48.9% | ||||
49.2% | 2.7% | 32.3% | 89.3% | 31.0% |
API usage
Endpoint:
Model card
Architecture Overview:
• Mixture-of-Experts (MoE) architecture with 1T total parameters and 32B activated parameters
• 61 total layers including 1 dense layer with 384 experts selecting 8 per token
• Multi-head Latent Attention (MLA) mechanism with 7168 attention hidden dimension
• Native INT4 quantization applied to MoE components through Quantization-Aware Training (QAT)
• 256K context window enabling complex long-horizon agentic tasks
• 160K vocabulary size with SwiGLU activation function
Training Methodology:
• End-to-end trained to interleave chain-of-thought reasoning with function calls
• Quantization-Aware Training (QAT) employed in post-training stage for lossless INT4 inference
• Specialized training for stable long-horizon agency across 200-300 consecutive tool invocations
• Advanced reasoning depth scaling through multi-step test-time computation
• Tool orchestration training enabling autonomous research, coding, and writing workflows
Performance Characteristics:
• State-of-the-art 44.9% on Humanity's Last Exam (HLE) with tools across 100+ expert subjects
• Leading agentic search performance: 60.2% BrowseComp, 62.3% BrowseComp-ZH, 56.3% Seal-0
• Elite mathematical reasoning: 99.1% AIME 2025 (w/ python), 95.1% HMMT 2025 (w/ python), 78.6% IMO-AnswerBench
• Strong coding capabilities: 71.3% SWE-Bench Verified, 61.1% SWE-Bench Multilingual, 83.1% LiveCodeBench v6
• 2x generation speed improvement through native INT4 quantization without performance degradation
• Maintains coherent goal-directed behavior surpassing prior models that degrade after 30-50 steps
Applications & use cases
Agentic Reasoning & Problem Solving:
• Expert-level reasoning across 100+ subjects achieving 44.9% on Humanity's Last Exam with tools
• PhD-level mathematical problem solving through 23+ interleaved reasoning and tool calls
• Elite competition mathematics: 99.1% AIME 2025, 95.1% HMMT 2025 with Python tools
• Dynamic hypothesis generation, evidence verification, and coherent answer construction
Agentic Search & Web Reasoning:
• State-of-the-art 60.2% BrowseComp performance, significantly outperforming 29.2% human baseline
• Continuous browsing, searching, and reasoning over hard-to-find real-world web information
• 200-300 sequential tool calls for deep research workflows without human interference
• Goal-directed web-based reasoning with adaptive hypothesis refinement
• Financial search: 47.4% FinSearchComp-T3, 87.0% Frames benchmark
Agentic Coding & Software Development:
• Production-level coding: 71.3% SWE-Bench Verified, 61.1% SWE-Bench Multilingual, 41.9% Multi-SWE-bench
• Component-heavy frontend development: fully functional HTML, React, and responsive web applications from single prompts
• Multi-step development workflows with precision tool invocation and adaptive reasoning
• Terminal automation: 47.1% Terminal-Bench with simulated tools
• Competitive programming: 83.1% LiveCodeBench v6, 48.7% OJ-Bench (C++)
Creative & Practical Writing:
• Creative writing with vivid imagery, emotional depth, and thematic resonance
• Fiction, cultural reviews, and science fiction with natural fluency and style command
• Academic and research writing with rigorous logic, thoroughness, and substantive richness
• 73.8% Longform Writing benchmark demonstrating instruction adherence and perspective breadth
• Personal and emotional responses with empathy, nuance, and actionable guidance
Long-Horizon Autonomous Workflows:
• Research automation executing hundreds of coherent reasoning steps
• Office automation and document generation workflows
• Multi-step coding projects from ideation to functional products
• Complex problem decomposition into clear, actionable subtasks
• Stable agency surpassing models that degrade after 30-50 steps
- Model providerMoonshot AI
- TypeChatCodeLLM
- Main use casesChatReasoningFunction Calling
- FeaturesFunction Calling
- Fine tuningSupported
- DeploymentServerlessOn-Demand DedicatedMonthly Reserved
- Endpoint
- Parameters1T
- Activated parameters32B
- Context length256K
- Input price
$1.20 / 1M tokens
- Output price
$4.00 / 1M tokens
- Input modalitiesText
- Output modalitiesText
- ReleasedNovember 4, 2025
- Last updatedNovember 9, 2025
- Quantization levelINT4
- External link
- CategoryChat