kimi-k2-thinking

Kimi K2 Thinking is Moonshot AI’s best open-source thinking model.

Built as a thinking agent, it reasons step by step while using tools, achieving state-of-the-art performance on Humanity’s Last Exam (HLE), BrowseComp, and other benchmarks, with major gains in reasoning, agentic search, coding, writing, and general capabilities.

Kimi K2 Thinking can execute up to 200 – 300 sequential tool calls without human interference, reasoning coherently across hundreds of steps to solve complex problems.

It marks Moonshot AI’s latest efforts in test-time scaling, by scaling both thinking tokens and tool calling steps.

Agentic Coding

K2 Thinking exhibits substantial gains in coding and software development tasks. It achieves scores of 61.1% on SWE-Multilingual, 71.3% on SWE-Bench Verified, and 47.1% on Terminal-Bench, showcasing strong generalization across programming languages and agent scaffolds.

The model delivers notable improvements on HTML, React, and component-intensive front-end tasks—translating ideas into fully functional, responsive products. In agentic coding settings, it reasons while invoking tools, integrating fluidly into software agents to execute complex, multi-step development workflows with precision and adaptability.

Agentic Search and Browsing

K2 Thinking demonstrates strong performance in agentic search and browsing scenarios. On BrowseComp—a challenging benchmark designed to evaluate models’ ability to continuously browse, search, and reason over hard-to-find real-world web information—K2 Thinking achieved a score of 60.2%, significantly outperforming the human baseline of 29.2%. This result highlights K2 Thinking’s superior capability for goal-directed, web-based reasoning and its robustness in dynamic, information-rich environments.

K2 Thinking can execute 200–300 sequential tool calls, driven by long-horizon planning and adaptive reasoning. It performs dynamic cycles of think → search → browser use → think → code, continually generating and refining hypotheses, verifying evidence, reasoning, and constructing coherent answers. This interleaved reasoning allows it to decompose ambiguous, open-ended problems into clear, actionable subtasks.

General Capabilities

Creative Writing: K2 Thinking delivers improvements in completeness and richness. It shows stronger command of style and instruction, handling diverse tones and formats with natural fluency. Its writing becomes more vivid and imaginative—poetic imagery carries deeper associations, while stories and scripts feel more human, emotional, and purposeful. The ideas it expresses often reach greater thematic depth and resonance.

Practical Writing: K2 Thinking demonstrates marked gains in reasoning depth, perspective breadth, and instruction adherence. It follows prompts with higher precision, addressing each requirement clearly and systematically—often expanding on every mentioned point to ensure thorough coverage. In academic, research, and long-form analytical writing, it excels at producing rigorous, logically coherent, and substantively rich content, making it particularly effective in scholarly and professional contexts.

Personal & Emotional: When addressing personal or emotional questions, K2 Thinking responds with more empathy and balance. Its reflections are thoughtful and specific, offering nuanced perspectives and actionable next steps. It helps users navigate complex decisions with clarity and care—grounded, practical, and genuinely human in tone.

Benchmarks

Kimi K2 Thinking sets new records across benchmarks that assess reasoning, coding, and agent capabilities. K2 Thinking achieves 44.9% on HLE with tools, 60.2% on BrowseComp, and 71.3% on SWE-Bench Verified, demonstrating strong generalization as a state-of-the-art thinking agent model.

Benchmark	Intro	K2 Thinking	GPT-5	Claude Sonnet 4.5 (Thinking)	K2 0905	DeepSeek-V3.2	Grok-4
Reasoning Tasks
Humanity’s Last Exam(Text-only)	no tools	23.9	26.3	19.8	7.9	19.8	25.4
	w/ tools	44.9	41.7	32.0	21.7	20.3	41.0
	heavy	51.0	42.0	—	—	—	50.7
AIME 2025	no tools	94.5	94.6	87.0	51.0	89.3	91.7
	w/ python	99.1	99.6	100.0	75.2	58.1	98.8
	heavy	100.0	100.0	—	—	—	100.0
HMMT 2025	no tools	89.4	93.3	74.6	38.8	83.6	90.0
	w/ python	95.1	96.7	88.8	70.4	49.5	93.9
	heavy	97.5	100.0	—	—	—	96.7
IMO-AnswerBench	no tools	78.6	76.0	65.9	45.8	76.0	73.1
GPQA-Diamond	no tools	84.5	85.7	83.4	74.2	79.9	87.5
General Tasks
MMLU-Pro	no tools	84.6	87.1	87.5	81.9	85.0	—
MMLU-Redux	no tools	94.4	95.3	95.6	92.7	93.7	—
Longform Writing	no tools	73.8	71.4	79.8	62.8	72.5	—
HealthBench	no tools	58.0	67.2	44.2	43.8	46.9	—
Agentic Search Tasks
BrowseComp	w/ tools	60.2	54.9	24.1	7.4	40.1	—
BrowseComp-ZH	w/ tools	62.3	63.0	42.4	22.2	47.9	—
Seal-0	w/ tools	56.3	51.4	53.4	25.2	38.5	—
FinSearchComp-T3	w/ tools	47.4	48.5	44.0	10.4	27.0	—
Frames	w/ tools	87.0	86.0	85.0	58.1	80.2	—
Coding Tasks
SWE-bench Verified	w/ tools	71.3	74.9	77.2	69.2	67.8	—
SWE-bench Multilingual	w/ tools	61.1	55.3	68.0	55.9	57.9	—
Multi-SWE-bench	w/ tools	41.9	39.3	44.3	33.5	30.6	—
SciCode	no tools	44.8	42.9	44.7	30.7	37.7	—
LiveCodeBench v6	no tools	83.1	87.0	64.0	56.1	74.1	—
OJ-Bench(cpp)	no tools	48.7	56.2	30.4	25.5	38.2	—
Terminal-Bench	w/ simulated tools (JSON)	47.1	43.8	51.0	44.5	37.7	—

Kimi K2 Thinking, Moonshot AI's best open-source thinking model.

Models

Readme

Agentic Coding

Agentic Search and Browsing

General Capabilities

Benchmarks