766 Downloads Updated 17 hours ago
Kimi K2 Thinking is Moonshot AI’s best open-source thinking model.
Built as a thinking agent, it reasons step by step while using tools, achieving state-of-the-art performance on Humanity’s Last Exam (HLE), BrowseComp, and other benchmarks, with major gains in reasoning, agentic search, coding, writing, and general capabilities.
Kimi K2 Thinking can execute up to 200 – 300 sequential tool calls without human interference, reasoning coherently across hundreds of steps to solve complex problems.
It marks Moonshot AI’s latest efforts in test-time scaling, by scaling both thinking tokens and tool calling steps.
K2 Thinking exhibits substantial gains in coding and software development tasks. It achieves scores of 61.1% on SWE-Multilingual, 71.3% on SWE-Bench Verified, and 47.1% on Terminal-Bench, showcasing strong generalization across programming languages and agent scaffolds.
The model delivers notable improvements on HTML, React, and component-intensive front-end tasks—translating ideas into fully functional, responsive products. In agentic coding settings, it reasons while invoking tools, integrating fluidly into software agents to execute complex, multi-step development workflows with precision and adaptability.
K2 Thinking demonstrates strong performance in agentic search and browsing scenarios. On BrowseComp—a challenging benchmark designed to evaluate models’ ability to continuously browse, search, and reason over hard-to-find real-world web information—K2 Thinking achieved a score of 60.2%, significantly outperforming the human baseline of 29.2%. This result highlights K2 Thinking’s superior capability for goal-directed, web-based reasoning and its robustness in dynamic, information-rich environments.
K2 Thinking can execute 200–300 sequential tool calls, driven by long-horizon planning and adaptive reasoning. It performs dynamic cycles of think → search → browser use → think → code, continually generating and refining hypotheses, verifying evidence, reasoning, and constructing coherent answers. This interleaved reasoning allows it to decompose ambiguous, open-ended problems into clear, actionable subtasks.
Creative Writing: K2 Thinking delivers improvements in completeness and richness. It shows stronger command of style and instruction, handling diverse tones and formats with natural fluency. Its writing becomes more vivid and imaginative—poetic imagery carries deeper associations, while stories and scripts feel more human, emotional, and purposeful. The ideas it expresses often reach greater thematic depth and resonance.
Practical Writing: K2 Thinking demonstrates marked gains in reasoning depth, perspective breadth, and instruction adherence. It follows prompts with higher precision, addressing each requirement clearly and systematically—often expanding on every mentioned point to ensure thorough coverage. In academic, research, and long-form analytical writing, it excels at producing rigorous, logically coherent, and substantively rich content, making it particularly effective in scholarly and professional contexts.
Personal & Emotional: When addressing personal or emotional questions, K2 Thinking responds with more empathy and balance. Its reflections are thoughtful and specific, offering nuanced perspectives and actionable next steps. It helps users navigate complex decisions with clarity and care—grounded, practical, and genuinely human in tone.
Kimi K2 Thinking sets new records across benchmarks that assess reasoning, coding, and agent capabilities. K2 Thinking achieves 44.9% on HLE with tools, 60.2% on BrowseComp, and 71.3% on SWE-Bench Verified, demonstrating strong generalization as a state-of-the-art thinking agent model.
| Benchmark | Intro | K2 Thinking | GPT-5 | Claude Sonnet 4.5 (Thinking) | K2 0905 | DeepSeek-V3.2 | Grok-4 |
|---|---|---|---|---|---|---|---|
| Reasoning Tasks | |||||||
| Humanity’s Last Exam(Text-only) | no tools | 23.9 | 26.3 | 19.8 | 7.9 | 19.8 | 25.4 |
| w/ tools | 44.9 | 41.7 | 32.0 | 21.7 | 20.3 | 41.0 | |
| heavy | 51.0 | 42.0 | — | — | — | 50.7 | |
| AIME 2025 | no tools | 94.5 | 94.6 | 87.0 | 51.0 | 89.3 | 91.7 |
| w/ python | 99.1 | 99.6 | 100.0 | 75.2 | 58.1 | 98.8 | |
| heavy | 100.0 | 100.0 | — | — | — | 100.0 | |
| HMMT 2025 | no tools | 89.4 | 93.3 | 74.6 | 38.8 | 83.6 | 90.0 |
| w/ python | 95.1 | 96.7 | 88.8 | 70.4 | 49.5 | 93.9 | |
| heavy | 97.5 | 100.0 | — | — | — | 96.7 | |
| IMO-AnswerBench | no tools | 78.6 | 76.0 | 65.9 | 45.8 | 76.0 | 73.1 |
| GPQA-Diamond | no tools | 84.5 | 85.7 | 83.4 | 74.2 | 79.9 | 87.5 |
| General Tasks | |||||||
| MMLU-Pro | no tools | 84.6 | 87.1 | 87.5 | 81.9 | 85.0 | — |
| MMLU-Redux | no tools | 94.4 | 95.3 | 95.6 | 92.7 | 93.7 | — |
| Longform Writing | no tools | 73.8 | 71.4 | 79.8 | 62.8 | 72.5 | — |
| HealthBench | no tools | 58.0 | 67.2 | 44.2 | 43.8 | 46.9 | — |
| Agentic Search Tasks | |||||||
| BrowseComp | w/ tools | 60.2 | 54.9 | 24.1 | 7.4 | 40.1 | — |
| BrowseComp-ZH | w/ tools | 62.3 | 63.0 | 42.4 | 22.2 | 47.9 | — |
| Seal-0 | w/ tools | 56.3 | 51.4 | 53.4 | 25.2 | 38.5 | — |
| FinSearchComp-T3 | w/ tools | 47.4 | 48.5 | 44.0 | 10.4 | 27.0 | — |
| Frames | w/ tools | 87.0 | 86.0 | 85.0 | 58.1 | 80.2 | — |
| Coding Tasks | |||||||
| SWE-bench Verified | w/ tools | 71.3 | 74.9 | 77.2 | 69.2 | 67.8 | — |
| SWE-bench Multilingual | w/ tools | 61.1 | 55.3 | 68.0 | 55.9 | 57.9 | — |
| Multi-SWE-bench | w/ tools | 41.9 | 39.3 | 44.3 | 33.5 | 30.6 | — |
| SciCode | no tools | 44.8 | 42.9 | 44.7 | 30.7 | 37.7 | — |
| LiveCodeBench v6 | no tools | 83.1 | 87.0 | 64.0 | 56.1 | 74.1 | — |
| OJ-Bench(cpp) | no tools | 48.7 | 56.2 | 30.4 | 25.5 | 38.2 | — |
| Terminal-Bench | w/ simulated tools (JSON) | 47.1 | 43.8 | 51.0 | 44.5 | 37.7 | — |