Coder teacher → STEM distillation → logical inference SFT → quantized. Structured reasoning in ~1.2GB.

Details

Updated 4 days ago

4 days ago

80acbca3fb0f · 1.3GB ·

model

archqwen3

parameters2.03B

quantizationQ4_K_M

1.3GB

params

{ "repeat_penalty": 1, "stop": [ "<|im_start|>", "<|im_end|>" ], "te

120B

template

{{- $lastUserIdx := -1 -}} {{- range $idx, $msg := .Messages -}} {{- if eq $msg.Role "user" }}{{ $la

1.5kB

Qwen3-1.7B-Coder-Distilled-SFT

A 1.7B model built in two stages: knowledge distillation from a 30B Coder teacher to establish a structured reasoning backbone, then supervised fine-tuning on ~54,600 logical inference problems. The Coder teacher’s decomposition patterns meet formal propositional logic.

The hypothesis: a model that learned STEM derivation from a Coder teacher (Stage 1) already has latent structure for sequential logic, state tracking, and compositional reasoning. Logical inference SFT (Stage 2) activates that structure explicitly — the model doesn’t learn logic from scratch, it surfaces what the Coder teacher already gave it.

“Structure beats scale, collaboration beats hierarchy, observation beats theory.” — Convergent Intelligence LLC: Research Division

Training Pipeline

Stage 1: Coder Teacher Knowledge Distillation (STEM Reasoning Backbone)

Qwen3-1.7B distilled from Qwen3-Coder-30B-A3B-Instruct — the coding-specialized variant of the 30B MoE architecture. Same STEM training data as the Instruct-teacher variants, but different teacher brain.

Why a Coder teacher? At distillation temperature T=2.0, the KL divergence transfers the teacher’s full probability landscape — not just domain knowledge, but how the teacher organizes reasoning. The Coder variant organizes reasoning through precise sequential logic, explicit state tracking, and compositional decomposition. These are the same capabilities that make mathematical derivations rigorous and logical inference sound.

Data: 6,122 STEM chain-of-thought samples across 12 domains from 0xZee:

Domain	Samples
Physics	2,254
Linear Algebra	667
Differential Equations	636
Electromagnetism	580
Mathematics	576
Engineering	574
Classical Mechanics	343
Theoretical Mechanics	307
Advanced Calculus	268
Modern Physics	177
Physiology	114
Molecular Biology	71

Loss function:

Proof-Weighted Cross-Entropy (55%) — 2.5x → 1.5x on derivation tokens
Knowledge Distillation KL Divergence (45%) — T=2.0, scaled by T²

Stage 1 hyperparameters:

Parameter	Value
Epochs	1
Training samples	5,815
Effective batch size	8
Learning rate	1.5e-5 → 1e-6 (cosine)
Temperature	2.0
Proof weight	2.5 → 1.5
Precision	bf16

Training format:

Solve the following problem carefully and show a rigorous derivation.

Problem:
{question}

Proof:
{CoT}

Final Answer:
{response}

Stage 2: Logical Inference SFT

The distilled model was fine-tuned on KonstantinDob/logic_inference_dataset — ~54,607 instruction-response pairs covering propositional logic, logical entailment, and formal inference.

About the dataset: Reproduced from the LogicInference paper (Santiago Ontañón, Google Research). Uses the IID split only with LOGICINFERENCEe format — the model performs logical inference first, then gives the final answer at the end. 5,491 unique inference problems extended to ~54,607 instruction-response pairs. Three columns: INSTRUCTION, RESPONSE, SOURCE.

Why logical inference after Coder-distilled STEM? The Coder teacher gave the model structured decomposition patterns. The STEM data taught it to apply those patterns to derivations. Logical inference SFT takes the next step: formal propositional logic with explicit premises, inference rules, and conclusions. This is the most natural downstream task for a Coder-distilled reasoner — it’s making the implicit structure explicit.

Training format:

### Instruction:
{instruction}

### Response:
{response}

Stage 2 hyperparameters:

Parameter	Value
Epochs	1
Effective batch size	8
Learning rate	5e-6 (lower than Stage 1 to preserve backbone)
Gradient checkpointing	Enabled
Precision	bf16

Model Details

Attribute	Value
Architecture	Qwen3 (causal LM, RoPE, GQA)
Parameters	~2B (1.7B advertised)
Base model	Qwen/Qwen3-1.7B
Teacher model	Qwen/Qwen3-Coder-30B-A3B-Instruct
Stage 1 data	6,122 STEM CoT samples (12 datasets)
Stage 2 data	KonstantinDob/logic_inference_dataset (~54,607 pairs)
Context length	1024 tokens (training)
License	Apache 2.0
Developer	Reaperdoesntrun / Convergent Intelligence LLC: Research Division

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
)

# Logical inference (Stage 2 format)
prompt = """### Instruction:
Consider the following premises: For all x, if x is a cat then x is a mammal. Whiskers is a cat. What can we infer?

### Response:
"""

# STEM derivation (Stage 1 format still works)
prompt_stem = """Solve the following problem carefully and show a rigorous derivation.

Problem:
Prove that the composition of two injective functions is injective.

Proof:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GGUF

Quantized versions at reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT-GGUF.

Prompt Formats

STEM derivation (Stage 1):

Solve the following problem carefully and show a rigorous derivation.

Problem:
[Your problem]

Proof:

Logical inference / instruction-following (Stage 2):

### Instruction:
[Your question or logical inference problem]

### Response:

Intended Uses

Good for: Logical inference, propositional logic, formal reasoning, STEM derivation, structured argumentation, educational tutoring, component in verification pipelines, edge deployment via GGUF.

Not for: General code generation (the Coder teacher influence is structural, not functional — use a dedicated code model), formal proof verification (use Lean/Coq), safety-critical analysis, or tasks requiring long context beyond 1024 tokens.

Limitations

1.7B model. Produces structured reasoning but can generate fluent incorrect logic. The Coder teacher gives structural decomposition, not code generation capability. Logical inference performance is strongest on propositional logic patterns represented in the training data. Complex multi-step inferences with many quantifiers may exceed the model’s capacity. Always verify.

Related Models

Model	Description
Qwen3-1.7B-Coder-Distilled	Stage 1 only — pure STEM backbone with Coder teacher
Qwen3-1.7B-Coder-Distilled-SFT-GGUF	This model quantized for edge deployment
Qwen3-1.7B-Distilled-30B-A3B-SFT	Instruct teacher + legal SFT variant
Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT	0.6B Thinking teacher + legal SFT

References

Santiago Ontañón. “LogicInference: A Large-Scale Dataset for Logical Inference.” ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models. Paper | Code

Convergent Intelligence LLC: Research Division “Where classical analysis fails to see, we begin.”