999 22 hours ago

Kimi K2 Thinking, Moonshot AI's best open-source thinking model.

cloud

Readme

Kimi K2 Thinking is Moonshot AI’s best open-source thinking model.

Built as a thinking agent, it reasons step by step while using tools, achieving state-of-the-art performance on Humanity’s Last Exam (HLE), BrowseComp, and other benchmarks, with major gains in reasoning, agentic search, coding, writing, and general capabilities.

Kimi K2 Thinking can execute up to 200 – 300 sequential tool calls without human interference, reasoning coherently across hundreds of steps to solve complex problems.

It marks Moonshot AI’s latest efforts in test-time scaling, by scaling both thinking tokens and tool calling steps.

Agentic Coding

K2 Thinking exhibits substantial gains in coding and software development tasks. It achieves scores of 61.1% on SWE-Multilingual, 71.3% on SWE-Bench Verified, and 47.1% on Terminal-Bench, showcasing strong generalization across programming languages and agent scaffolds.

The model delivers notable improvements on HTML, React, and component-intensive front-end tasks—translating ideas into fully functional, responsive products. In agentic coding settings, it reasons while invoking tools, integrating fluidly into software agents to execute complex, multi-step development workflows with precision and adaptability.

Agentic Search and Browsing

K2 Thinking demonstrates strong performance in agentic search and browsing scenarios. On BrowseComp—a challenging benchmark designed to evaluate models’ ability to ​continuously browse, search, and reason over hard-to-find real-world web information​—K2 Thinking achieved a score of ​​60.​2%, significantly outperforming the human baseline of 29.2%. This result highlights K2 Thinking’s superior capability for goal-directed, web-based reasoning and its robustness in dynamic, information-rich environments.

K2 Thinking can execute ​200–300 sequential tool calls​, driven by long-horizon planning and ​adaptive reasoning​. It performs dynamic cycles of ​think → search → browser use → think → code​, continually generating and refining hypotheses, verifying evidence, reasoning, and constructing coherent answers. This interleaved reasoning allows it to decompose ambiguous, open-ended problems into clear, actionable subtasks.

General Capabilities

Creative Writing: ​K2 Thinking delivers improvements in completeness and richness. It shows stronger command of style and instruction, handling diverse tones and formats with natural fluency. Its writing becomes more vivid and imaginative—poetic imagery carries deeper associations, while stories and scripts feel more human, emotional, and purposeful. The ideas it expresses often reach greater thematic depth and resonance.

Practical Writing: ​K2 Thinking demonstrates marked gains in reasoning depth, perspective breadth, and instruction adherence. It follows prompts with higher precision, addressing each requirement clearly and systematically—often expanding on every mentioned point to ensure thorough coverage. In academic, research, and long-form analytical writing, it excels at producing rigorous, logically coherent, and substantively rich content, making it particularly effective in scholarly and professional contexts.

Personal & Emotional: ​When addressing personal or emotional questions, K2 Thinking responds with more empathy and balance. Its reflections are thoughtful and specific, offering nuanced perspectives and actionable next steps. It helps users navigate complex decisions with clarity and care—grounded, practical, and genuinely human in tone.

Benchmarks

Kimi K2 Thinking sets new records across benchmarks that assess reasoning, coding, and agent capabilities. K2 Thinking achieves 44.9% on HLE with tools, 60.2% on BrowseComp, and 71.3% on SWE-Bench Verified, demonstrating strong generalization as a state-of-the-art thinking agent model.

Benchmark Intro K2 Thinking GPT-5 Claude Sonnet 4.5 (Thinking) K2 0905 DeepSeek-V3.2 Grok-4
Reasoning Tasks
Humanity’s Last Exam(Text-only) no tools 23.9 26.3 19.8 7.9 19.8 25.4
w/ tools 44.9 41.7 32.0 21.7 20.3 41.0
heavy 51.0 42.0 50.7
AIME 2025 no tools 94.5 94.6 87.0 51.0 89.3 91.7
w/ python 99.1 99.6 100.0 75.2 58.1 98.8
heavy 100.0 100.0 100.0
HMMT 2025 no tools 89.4 93.3 74.6 38.8 83.6 90.0
w/ python 95.1 96.7 88.8 70.4 49.5 93.9
heavy 97.5 100.0 96.7
IMO-AnswerBench no tools 78.6 76.0 65.9 45.8 76.0 73.1
GPQA-Diamond no tools 84.5 85.7 83.4 74.2 79.9 87.5
General Tasks
MMLU-Pro no tools 84.6 87.1 87.5 81.9 85.0
MMLU-Redux no tools 94.4 95.3 95.6 92.7 93.7
Longform Writing no tools 73.8 71.4 79.8 62.8 72.5
HealthBench no tools 58.0 67.2 44.2 43.8 46.9
Agentic Search Tasks
BrowseComp w/ tools 60.2 54.9 24.1 7.4 40.1
BrowseComp-ZH w/ tools 62.3 63.0 42.4 22.2 47.9
Seal-0 w/ tools 56.3 51.4 53.4 25.2 38.5
FinSearchComp-T3 w/ tools 47.4 48.5 44.0 10.4 27.0
Frames w/ tools 87.0 86.0 85.0 58.1 80.2
Coding Tasks
SWE-bench Verified w/ tools 71.3 74.9 77.2 69.2 67.8
SWE-bench Multilingual w/ tools 61.1 55.3 68.0 55.9 57.9
Multi-SWE-bench w/ tools 41.9 39.3 44.3 33.5 30.6
SciCode no tools 44.8 42.9 44.7 30.7 37.7
LiveCodeBench v6 no tools 83.1 87.0 64.0 56.1 74.1
OJ-Bench(cpp) no tools 48.7 56.2 30.4 25.5 38.2
Terminal-Bench w/ simulated tools (JSON) 47.1 43.8 51.0 44.5 37.7