36 Downloads Updated 4 days ago
ollama run reaperdoesntrun/Qwen3-Distilled-Coder-1.7B
Updated 4 days ago
4 days ago
80acbca3fb0f · 1.3GB ·
A 1.7B model built in two stages: knowledge distillation from a 30B Coder teacher to establish a structured reasoning backbone, then supervised fine-tuning on ~54,600 logical inference problems. The Coder teacher’s decomposition patterns meet formal propositional logic.
The hypothesis: a model that learned STEM derivation from a Coder teacher (Stage 1) already has latent structure for sequential logic, state tracking, and compositional reasoning. Logical inference SFT (Stage 2) activates that structure explicitly — the model doesn’t learn logic from scratch, it surfaces what the Coder teacher already gave it.
“Structure beats scale, collaboration beats hierarchy, observation beats theory.” — Convergent Intelligence LLC: Research Division
Qwen3-1.7B distilled from Qwen3-Coder-30B-A3B-Instruct — the coding-specialized variant of the 30B MoE architecture. Same STEM training data as the Instruct-teacher variants, but different teacher brain.
Why a Coder teacher? At distillation temperature T=2.0, the KL divergence transfers the teacher’s full probability landscape — not just domain knowledge, but how the teacher organizes reasoning. The Coder variant organizes reasoning through precise sequential logic, explicit state tracking, and compositional decomposition. These are the same capabilities that make mathematical derivations rigorous and logical inference sound.
Data: 6,122 STEM chain-of-thought samples across 12 domains from 0xZee:
| Domain | Samples |
|---|---|
| Physics | 2,254 |
| Linear Algebra | 667 |
| Differential Equations | 636 |
| Electromagnetism | 580 |
| Mathematics | 576 |
| Engineering | 574 |
| Classical Mechanics | 343 |
| Theoretical Mechanics | 307 |
| Advanced Calculus | 268 |
| Modern Physics | 177 |
| Physiology | 114 |
| Molecular Biology | 71 |
Loss function:
Stage 1 hyperparameters:
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Training samples | 5,815 |
| Effective batch size | 8 |
| Learning rate | 1.5e-5 → 1e-6 (cosine) |
| Temperature | 2.0 |
| Proof weight | 2.5 → 1.5 |
| Precision | bf16 |
Training format:
Solve the following problem carefully and show a rigorous derivation.
Problem:
{question}
Proof:
{CoT}
Final Answer:
{response}
The distilled model was fine-tuned on KonstantinDob/logic_inference_dataset — ~54,607 instruction-response pairs covering propositional logic, logical entailment, and formal inference.
About the dataset: Reproduced from the LogicInference paper (Santiago Ontañón, Google Research). Uses the IID split only with LOGICINFERENCEe format — the model performs logical inference first, then gives the final answer at the end. 5,491 unique inference problems extended to ~54,607 instruction-response pairs. Three columns: INSTRUCTION, RESPONSE, SOURCE.
Why logical inference after Coder-distilled STEM? The Coder teacher gave the model structured decomposition patterns. The STEM data taught it to apply those patterns to derivations. Logical inference SFT takes the next step: formal propositional logic with explicit premises, inference rules, and conclusions. This is the most natural downstream task for a Coder-distilled reasoner — it’s making the implicit structure explicit.
Training format:
### Instruction:
{instruction}
### Response:
{response}
Stage 2 hyperparameters:
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Effective batch size | 8 |
| Learning rate | 5e-6 (lower than Stage 1 to preserve backbone) |
| Gradient checkpointing | Enabled |
| Precision | bf16 |
| Attribute | Value |
|---|---|
| Architecture | Qwen3 (causal LM, RoPE, GQA) |
| Parameters | ~2B (1.7B advertised) |
| Base model | Qwen/Qwen3-1.7B |
| Teacher model | Qwen/Qwen3-Coder-30B-A3B-Instruct |
| Stage 1 data | 6,122 STEM CoT samples (12 datasets) |
| Stage 2 data | KonstantinDob/logic_inference_dataset (~54,607 pairs) |
| Context length | 1024 tokens (training) |
| License | Apache 2.0 |
| Developer | Reaperdoesntrun / Convergent Intelligence LLC: Research Division |
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
device_map="auto",
)
# Logical inference (Stage 2 format)
prompt = """### Instruction:
Consider the following premises: For all x, if x is a cat then x is a mammal. Whiskers is a cat. What can we infer?
### Response:
"""
# STEM derivation (Stage 1 format still works)
prompt_stem = """Solve the following problem carefully and show a rigorous derivation.
Problem:
Prove that the composition of two injective functions is injective.
Proof:
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Quantized versions at reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT-GGUF.
STEM derivation (Stage 1):
Solve the following problem carefully and show a rigorous derivation.
Problem:
[Your problem]
Proof:
Logical inference / instruction-following (Stage 2):
### Instruction:
[Your question or logical inference problem]
### Response:
Good for: Logical inference, propositional logic, formal reasoning, STEM derivation, structured argumentation, educational tutoring, component in verification pipelines, edge deployment via GGUF.
Not for: General code generation (the Coder teacher influence is structural, not functional — use a dedicated code model), formal proof verification (use Lean/Coq), safety-critical analysis, or tasks requiring long context beyond 1024 tokens.
1.7B model. Produces structured reasoning but can generate fluent incorrect logic. The Coder teacher gives structural decomposition, not code generation capability. Logical inference performance is strongest on propositional logic patterns represented in the training data. Complex multi-step inferences with many quantifiers may exceed the model’s capacity. Always verify.
| Model | Description |
|---|---|
| Qwen3-1.7B-Coder-Distilled | Stage 1 only — pure STEM backbone with Coder teacher |
| Qwen3-1.7B-Coder-Distilled-SFT-GGUF | This model quantized for edge deployment |
| Qwen3-1.7B-Distilled-30B-A3B-SFT | Instruct teacher + legal SFT variant |
| Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT | 0.6B Thinking teacher + legal SFT |
Santiago Ontañón. “LogicInference: A Large-Scale Dataset for Logical Inference.” ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models. Paper | Code
Convergent Intelligence LLC: Research Division “Where classical analysis fails to see, we begin.”