24 Downloads Updated 4 days ago
ollama run reaperdoesntrun/Qwen3-0.6B-Distilled
ollama launch claude --model reaperdoesntrun/Qwen3-0.6B-Distilled
ollama launch codex --model reaperdoesntrun/Qwen3-0.6B-Distilled
ollama launch opencode --model reaperdoesntrun/Qwen3-0.6B-Distilled
ollama launch openclaw --model reaperdoesntrun/Qwen3-0.6B-Distilled
A 0.6B parameter model built in two stages: knowledge distillation from a 30B Thinking teacher to establish a structured reasoning backbone, then supervised fine-tuning on legal instruction data. 50x compression. Under 500MB quantized. Runs on a phone.
The training order is the thesis: teach the model how to reason first (distillation from Thinking teacher), then teach it what to reason about (legal SFT). The Thinking teacher’s extended deliberation traces transfer deeper reasoning structure than an Instruct teacher — critical when the student has only 0.6B parameters to work with.
“Structure beats scale, collaboration beats hierarchy, observation beats theory.” — Convergent Intelligence LLC: Research Division
Qwen3-0.6B distilled from Qwen3-30B-A3B-Thinking-2507 — a Mixture-of-Experts model with 30B total parameters, ~3B active per token, using the Thinking variant that generates extended internal reasoning traces.
Why the Thinking teacher matters at 0.6B: The Thinking variant produces higher-entropy softmax distributions than the Instruct variant — it considers more reasoning paths before committing. At distillation temperature T=2.0, the 0.6B student sees a richer landscape of alternative derivation strategies. With only 0.6B parameters, every bit of transferred structure counts. The Thinking teacher gives more.
Data: 6,122 STEM chain-of-thought samples across 12 domains:
| Domain | Samples |
|---|---|
| Physics | 2,254 |
| Linear Algebra | 667 |
| Differential Equations | 636 |
| Electromagnetism | 580 |
| Mathematics | 576 |
| Engineering | 574 |
| Classical Mechanics | 343 |
| Theoretical Mechanics | 307 |
| Advanced Calculus | 268 |
| Modern Physics | 177 |
| Physiology | 114 |
| Molecular Biology | 71 |
All from 0xZee. Shuffled seed 42, split 95⁄5 train/eval.
Loss function:
Training format:
Solve the following problem carefully and show a rigorous derivation.
Problem:
{question}
Proof:
{CoT}
Final Answer:
{response}
Stage 1 hyperparameters:
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Training samples | 5,815 |
| Effective batch size | 8 |
| Learning rate | 1.5e-5 → 1e-6 (cosine) |
| Temperature | 2.0 |
| Proof weight | 2.5 → 1.5 |
| Precision | bf16 |
The distilled model was fine-tuned on Alignment-Lab-AI/Lawyer-Instruct using TRL’s SFTTrainer.
Why legal on top of STEM: Legal reasoning is structurally isomorphic to mathematical reasoning — premise identification, logical chaining, exception handling, structured argumentation toward a conclusion. A model that learned rigorous derivation transfers that structure to legal analysis rather than learning legal templates from scratch.
Training format:
### Instruction:
{instruction}
### Response:
{output}
Stage 2 hyperparameters:
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Effective batch size | 8 |
| Learning rate | 5e-6 (lower than Stage 1 to preserve backbone) |
| Gradient checkpointing | Enabled |
| Precision | bf16 |
| Attribute | Value |
|---|---|
| Architecture | Qwen3 (causal LM, RoPE, GQA) |
| Parameters | 0.6B |
| Base model | Qwen/Qwen3-0.6B |
| Teacher model | Qwen/Qwen3-30B-A3B-Thinking-2507 |
| Compression ratio | 50x |
| Stage 1 data | 6,122 STEM CoT samples (12 datasets) |
| Stage 2 data | Alignment-Lab-AI/Lawyer-Instruct |
| Context length | 1024 tokens (training) |
| License | Apache 2.0 |
| Developer | Reaperdoesntrun / Convergent Intelligence LLC: Research Division |
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
device_map="auto",
)
# Legal instruction-following
prompt = """### Instruction:
What is the difference between a felony and a misdemeanor?
### Response:
"""
# STEM derivation (Stage 1 format still works)
prompt_stem = """Solve the following problem carefully and show a rigorous derivation.
Problem:
Compute the determinant of the matrix [[1, 2], [3, 4]].
Proof:
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Quantized versions at reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF.
STEM derivation (Stage 1):
Solve the following problem carefully and show a rigorous derivation.
Problem:
[Your problem]
Proof:
Instruction-following (Stage 2):
### Instruction:
[Your question]
### Response:
Good for: Ultra-lightweight reasoning on mobile/edge/IoT, legal and STEM instruction-following, educational tutoring, embedded inference, component in multi-model pipelines, anywhere you need reasoning in under 500MB.
Not for: Formal proof verification, actual legal counsel, safety-critical analysis, complex multi-step proofs (>8 steps), or long-context tasks beyond 1024 tokens.
0.6B is a hard capacity constraint. The model trades depth for deployability. It will make reasoning errors that a larger model would not. Multi-step derivations beyond ~8 steps degrade. Legal reasoning covers general concepts but lacks the nuance of larger models. Performance is weakest on underrepresented domains (molecular biology, physiology). Always verify outputs.
| Model | Description |
|---|---|
| Qwen3-0.6B-STEM-Proof-Distilled-Thinking | Stage 1 only — pure STEM backbone |
| Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF | This model quantized for edge deployment |
| Qwen3-1.7B-STEM-Proof-Distilled | Larger 1.7B variant (Instruct teacher) |
| Qwen3-1.7B-Distilled-30B-A3B-SFT | Larger 1.7B variant + legal SFT |
@misc{colca2026thinking06bsft,
title={Two-Stage Reasoning Transfer at 0.6B: Thinking Teacher Distillation + Legal SFT},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT},
note={Convergent Intelligence LLC: Research Division}
}
Convergent Intelligence LLC: Research Division “Where classical analysis fails to see, we begin.”