A 0.6B parameter model built in two stages: knowledge distillation from a 30B Thinking teacher to establish a structured reasoning backbone, then supervised fine-tuning on legal instruction data. 50x compression. Under 500MB quantized. Runs on a phone.

Details

Updated 2 months ago

2 months ago

f31e321d12f3 · 484MB ·

model

archqwen3

parameters752M

quantizationQ4_K_M

484MB

params

{ "repeat_penalty": 1, "stop": [ "<|im_start|>", "<|im_end|>" ], "te

120B

template

{{- $lastUserIdx := -1 -}} {{- range $idx, $msg := .Messages -}} {{- if eq $msg.Role "user" }}{{ $la

1.5kB

Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT

The training order is the thesis: teach the model how to reason first (distillation from Thinking teacher), then teach it what to reason about (legal SFT). The Thinking teacher’s extended deliberation traces transfer deeper reasoning structure than an Instruct teacher — critical when the student has only 0.6B parameters to work with.

“Structure beats scale, collaboration beats hierarchy, observation beats theory.” — Convergent Intelligence LLC: Research Division

Training Pipeline

Stage 1: Knowledge Distillation (STEM Reasoning Backbone)

Qwen3-0.6B distilled from Qwen3-30B-A3B-Thinking-2507 — a Mixture-of-Experts model with 30B total parameters, ~3B active per token, using the Thinking variant that generates extended internal reasoning traces.

Why the Thinking teacher matters at 0.6B: The Thinking variant produces higher-entropy softmax distributions than the Instruct variant — it considers more reasoning paths before committing. At distillation temperature T=2.0, the 0.6B student sees a richer landscape of alternative derivation strategies. With only 0.6B parameters, every bit of transferred structure counts. The Thinking teacher gives more.

Data: 6,122 STEM chain-of-thought samples across 12 domains:

Domain	Samples
Physics	2,254
Linear Algebra	667
Differential Equations	636
Electromagnetism	580
Mathematics	576
Engineering	574
Classical Mechanics	343
Theoretical Mechanics	307
Advanced Calculus	268
Modern Physics	177
Physiology	114
Molecular Biology	71

All from 0xZee. Shuffled seed 42, split ⁹⁵⁄₅ train/eval.

Loss function:

Proof-Weighted Cross-Entropy (55%) — 2.5x weight on derivation tokens, decaying to 1.5x. Forces the student to allocate its limited capacity to reasoning steps, not answer formatting.
Knowledge Distillation KL Divergence (45%) — T=2.0, scaled by T². Transfers the Thinking teacher’s full deliberation landscape.

Training format:

Solve the following problem carefully and show a rigorous derivation.

Problem:
{question}

Proof:
{CoT}

Final Answer:
{response}

Stage 1 hyperparameters:

Parameter	Value
Epochs	1
Training samples	5,815
Effective batch size	8
Learning rate	1.5e-5 → 1e-6 (cosine)
Temperature	2.0
Proof weight	2.5 → 1.5
Precision	bf16

Stage 2: Supervised Fine-Tuning (Legal Domain)

The distilled model was fine-tuned on Alignment-Lab-AI/Lawyer-Instruct using TRL’s SFTTrainer.

Why legal on top of STEM: Legal reasoning is structurally isomorphic to mathematical reasoning — premise identification, logical chaining, exception handling, structured argumentation toward a conclusion. A model that learned rigorous derivation transfers that structure to legal analysis rather than learning legal templates from scratch.

Training format:

### Instruction:
{instruction}

### Response:
{output}

Stage 2 hyperparameters:

Parameter	Value
Epochs	1
Effective batch size	8
Learning rate	5e-6 (lower than Stage 1 to preserve backbone)
Gradient checkpointing	Enabled
Precision	bf16

Model Details

Attribute	Value
Architecture	Qwen3 (causal LM, RoPE, GQA)
Parameters	0.6B
Base model	Qwen/Qwen3-0.6B
Teacher model	Qwen/Qwen3-30B-A3B-Thinking-2507
Compression ratio	50x
Stage 1 data	6,122 STEM CoT samples (12 datasets)
Stage 2 data	Alignment-Lab-AI/Lawyer-Instruct
Context length	1024 tokens (training)
License	Apache 2.0
Developer	Reaperdoesntrun / Convergent Intelligence LLC: Research Division

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
)

# Legal instruction-following
prompt = """### Instruction:
What is the difference between a felony and a misdemeanor?

### Response:
"""

# STEM derivation (Stage 1 format still works)
prompt_stem = """Solve the following problem carefully and show a rigorous derivation.

Problem:
Compute the determinant of the matrix [[1, 2], [3, 4]].

Proof:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GGUF

Quantized versions at reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF.

Prompt Formats

STEM derivation (Stage 1):

Solve the following problem carefully and show a rigorous derivation.

Problem:
[Your problem]

Proof:

Instruction-following (Stage 2):

### Instruction:
[Your question]

### Response:

Intended Uses

Good for: Ultra-lightweight reasoning on mobile/edge/IoT, legal and STEM instruction-following, educational tutoring, embedded inference, component in multi-model pipelines, anywhere you need reasoning in under 500MB.

Not for: Formal proof verification, actual legal counsel, safety-critical analysis, complex multi-step proofs (>8 steps), or long-context tasks beyond 1024 tokens.

Limitations

0.6B is a hard capacity constraint. The model trades depth for deployability. It will make reasoning errors that a larger model would not. Multi-step derivations beyond ~8 steps degrade. Legal reasoning covers general concepts but lacks the nuance of larger models. Performance is weakest on underrepresented domains (molecular biology, physiology). Always verify outputs.

Related Models

Model	Description
Qwen3-0.6B-STEM-Proof-Distilled-Thinking	Stage 1 only — pure STEM backbone
Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF	This model quantized for edge deployment
Qwen3-1.7B-STEM-Proof-Distilled	Larger 1.7B variant (Instruct teacher)
Qwen3-1.7B-Distilled-30B-A3B-SFT	Larger 1.7B variant + legal SFT

Citation

@misc{colca2026thinking06bsft,
  title={Two-Stage Reasoning Transfer at 0.6B: Thinking Teacher Distillation + Legal SFT},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT},
  note={Convergent Intelligence LLC: Research Division}
}

Convergent Intelligence LLC: Research Division “Where classical analysis fails to see, we begin.”