699 3 weeks ago

3B model that shouldn't be this good - crushes benchmarks through deep chain-of-thought reasoning

ollama run fauxpaslife/nanbeige4.1

Models

View all →

Readme

Nanbeige 4.1 - 3B (Q8_0)

Original model by Nanbeige | GGUF conversion by tantk

** Note: This is a very verbose model, I am impressed by its size and speed + chain-of-thought.

What makes this special

image.png

First 3B model to nail BOTH reasoning AND agentic tool use. Most small models pick one lane - this crushes both.

Built with SFT + RL on top of Nanbeige4-3B-Base. Uses internal chain-of-thought reasoning <think> block explosion lol!

Benchmark highlights

Punches 10x above its weight: - Deep Search: 69.9 (Qwen3-32B: 31.6) 🤯 - Arena-Hard-v2: 73.2 (beats Qwen3-32B’s 56.0) - Code: 76.9 LiveCodeBench-V6 - Math: 87.4 AIME 2026, 53.4 IMO-Answer-Bench - Science: 83.8 GPQA, 12.6 HLE - Tool Use: 56.5 BFCL-V4, supports 500+ round tool chains

Best for

  • Multi-step reasoning tasks
  • Complex routing decisions (medical/emotional/activity)
  • RAG with deep semantic search
  • Agentic workflows with tool calling
  • Fast local inference with GPT-4 class reasoning depth

Recommended settings

ollama run fauxpaslife/nanbeige4.1 --temperature 0.6 --top-p 0.95

Notes

  • Native deep-search capability (rare for <10B models)
  • Sustained reasoning across complex problem chains
  • Strong preference alignment (beats much larger models)
  • Max context: 131K tokens

See technical report for full details.