24 4 days ago

A 0.6B parameter model built in two stages: knowledge distillation from a 30B Thinking teacher to establish a structured reasoning backbone, then supervised fine-tuning on legal instruction data. 50x compression. Under 500MB quantized. Runs on a phone.

tools thinking
ollama run reaperdoesntrun/Qwen3-0.6B-Distilled

Details

4 days ago

f31e321d12f3 · 484MB ·

qwen3
·
752M
·
Q4_K_M
{{- $lastUserIdx := -1 -}} {{- range $idx, $msg := .Messages -}} {{- if eq $msg.Role "user" }}{{ $la
{ "repeat_penalty": 1, "stop": [ "<|im_start|>", "<|im_end|>" ], "te

Readme

Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT

A 0.6B parameter model built in two stages: knowledge distillation from a 30B Thinking teacher to establish a structured reasoning backbone, then supervised fine-tuning on legal instruction data. 50x compression. Under 500MB quantized. Runs on a phone.

The training order is the thesis: teach the model how to reason first (distillation from Thinking teacher), then teach it what to reason about (legal SFT). The Thinking teacher’s extended deliberation traces transfer deeper reasoning structure than an Instruct teacher — critical when the student has only 0.6B parameters to work with.

“Structure beats scale, collaboration beats hierarchy, observation beats theory.” — Convergent Intelligence LLC: Research Division

Training Pipeline

Stage 1: Knowledge Distillation (STEM Reasoning Backbone)

Qwen3-0.6B distilled from Qwen3-30B-A3B-Thinking-2507 — a Mixture-of-Experts model with 30B total parameters, ~3B active per token, using the Thinking variant that generates extended internal reasoning traces.

Why the Thinking teacher matters at 0.6B: The Thinking variant produces higher-entropy softmax distributions than the Instruct variant — it considers more reasoning paths before committing. At distillation temperature T=2.0, the 0.6B student sees a richer landscape of alternative derivation strategies. With only 0.6B parameters, every bit of transferred structure counts. The Thinking teacher gives more.

Data: 6,122 STEM chain-of-thought samples across 12 domains:

Domain Samples
Physics 2,254
Linear Algebra 667
Differential Equations 636
Electromagnetism 580
Mathematics 576
Engineering 574
Classical Mechanics 343
Theoretical Mechanics 307
Advanced Calculus 268
Modern Physics 177
Physiology 114
Molecular Biology 71

All from 0xZee. Shuffled seed 42, split 955 train/eval.

Loss function:

  1. Proof-Weighted Cross-Entropy (55%) — 2.5x weight on derivation tokens, decaying to 1.5x. Forces the student to allocate its limited capacity to reasoning steps, not answer formatting.
  2. Knowledge Distillation KL Divergence (45%) — T=2.0, scaled by T². Transfers the Thinking teacher’s full deliberation landscape.

Training format:

Solve the following problem carefully and show a rigorous derivation.

Problem:
{question}

Proof:
{CoT}

Final Answer:
{response}

Stage 1 hyperparameters:

Parameter Value
Epochs 1
Training samples 5,815
Effective batch size 8
Learning rate 1.5e-5 → 1e-6 (cosine)
Temperature 2.0
Proof weight 2.5 → 1.5
Precision bf16

Stage 2: Supervised Fine-Tuning (Legal Domain)

The distilled model was fine-tuned on Alignment-Lab-AI/Lawyer-Instruct using TRL’s SFTTrainer.

Why legal on top of STEM: Legal reasoning is structurally isomorphic to mathematical reasoning — premise identification, logical chaining, exception handling, structured argumentation toward a conclusion. A model that learned rigorous derivation transfers that structure to legal analysis rather than learning legal templates from scratch.

Training format:

### Instruction:
{instruction}

### Response:
{output}

Stage 2 hyperparameters:

Parameter Value
Epochs 1
Effective batch size 8
Learning rate 5e-6 (lower than Stage 1 to preserve backbone)
Gradient checkpointing Enabled
Precision bf16

Model Details

Attribute Value
Architecture Qwen3 (causal LM, RoPE, GQA)
Parameters 0.6B
Base model Qwen/Qwen3-0.6B
Teacher model Qwen/Qwen3-30B-A3B-Thinking-2507
Compression ratio 50x
Stage 1 data 6,122 STEM CoT samples (12 datasets)
Stage 2 data Alignment-Lab-AI/Lawyer-Instruct
Context length 1024 tokens (training)
License Apache 2.0
Developer Reaperdoesntrun / Convergent Intelligence LLC: Research Division

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
)

# Legal instruction-following
prompt = """### Instruction:
What is the difference between a felony and a misdemeanor?

### Response:
"""

# STEM derivation (Stage 1 format still works)
prompt_stem = """Solve the following problem carefully and show a rigorous derivation.

Problem:
Compute the determinant of the matrix [[1, 2], [3, 4]].

Proof:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GGUF

Quantized versions at reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF.

Prompt Formats

STEM derivation (Stage 1):

Solve the following problem carefully and show a rigorous derivation.

Problem:
[Your problem]

Proof:

Instruction-following (Stage 2):

### Instruction:
[Your question]

### Response:

Intended Uses

Good for: Ultra-lightweight reasoning on mobile/edge/IoT, legal and STEM instruction-following, educational tutoring, embedded inference, component in multi-model pipelines, anywhere you need reasoning in under 500MB.

Not for: Formal proof verification, actual legal counsel, safety-critical analysis, complex multi-step proofs (>8 steps), or long-context tasks beyond 1024 tokens.

Limitations

0.6B is a hard capacity constraint. The model trades depth for deployability. It will make reasoning errors that a larger model would not. Multi-step derivations beyond ~8 steps degrade. Legal reasoning covers general concepts but lacks the nuance of larger models. Performance is weakest on underrepresented domains (molecular biology, physiology). Always verify outputs.

Related Models

Model Description
Qwen3-0.6B-STEM-Proof-Distilled-Thinking Stage 1 only — pure STEM backbone
Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF This model quantized for edge deployment
Qwen3-1.7B-STEM-Proof-Distilled Larger 1.7B variant (Instruct teacher)
Qwen3-1.7B-Distilled-30B-A3B-SFT Larger 1.7B variant + legal SFT

Citation

@misc{colca2026thinking06bsft,
  title={Two-Stage Reasoning Transfer at 0.6B: Thinking Teacher Distillation + Legal SFT},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT},
  note={Convergent Intelligence LLC: Research Division}
}

Convergent Intelligence LLC: Research Division “Where classical analysis fails to see, we begin.”