1,158 2 months ago

A Quantized, Fine-Tuned Model for Enhanced Tool Calling, Code Generation, and Reasoning

tools thinking
ollama run brnpistone/Qwen3-4B-AgentCoder-q6-k

Applications

Claude Code
Claude Code ollama launch claude --model brnpistone/Qwen3-4B-AgentCoder-q6-k
Codex
Codex ollama launch codex --model brnpistone/Qwen3-4B-AgentCoder-q6-k
OpenCode
OpenCode ollama launch opencode --model brnpistone/Qwen3-4B-AgentCoder-q6-k
OpenClaw
OpenClaw ollama launch openclaw --model brnpistone/Qwen3-4B-AgentCoder-q6-k

Models

View all →

Readme

๐Ÿง  Qwen3-4B-AgentCoder-GGUF

A Quantized, Fine-Tuned Model for Enhanced Tool Calling, Code Generation, and Reasoning


Model Description

Qwen3-4B-AgentCoder-GGUF is a quantized, version of the Qwen3-4B-AgentCoder model, using llama.cpp. This model is optimized for: - ๐Ÿงฎ Complex reasoning tasks - ๐Ÿงฐ Tool calling - ๐Ÿ’ป Code generation

The model was developed through sequential fine-tuning, followed by a Direct Preference Optimization (DPO) post-training stage to improve alignment, coherence, and reasoning accuracy.

Highlights

  • Fine-tuned on three specialized datasets
  • Retains thinking-mode behavior with long-context reasoning (~264K tokens)
  • Post-trained with DPO using chosen/rejected pairs for better alignment
  • Excellent balance between tool use, code generation, and reasoning

๐Ÿš€ Direct Use

Qwen3-4B-AgentCoder-GGUF can be used directly for: - โœ… Tool calling in complex reasoning tasks - โœ… Code generation for Python, JS, and other languages - โœ… Multi-domain reasoning (math, logic, Q&A)

โš ๏ธ Out-of-Scope Use

  • โŒ Highly sensitive or confidential data
  • โŒ Domains requiring expert-level specialization
  • โŒ Tasks where full explainability is mandatory

๐Ÿง  Training Details

Training Procedure

Phase 1 โ€” Supervised Fine-Tuning

  • Learning rate: 1e-5
  • Batch size: 4
  • Gradient accumulation: 4
  • Epochs: 3
  • Warmup steps: 100
  • Weight decay: 0.01
  • Sequence length: ~2.4K tokens
Training Data
  • interstellarninja/hermes_reasoning_tool_use (~51K) โ€” multi-turn tool use

Phase 2 โ€” Sequential Fine-Tuning (Supervised)

  • Learning rate: 3e-5
  • Batch size: 1
  • Gradient accumulation: 8
  • Epochs: 2
  • Warmup steps: 100
  • Weight decay: 0.05
  • Sequence length: ~13K tokens
Training Data
  • ise-uiuc/Magicoder-OSS-Instruct-75K (~38K) โ€” code generation
  • open-thoughts/OpenThoughts-114k (~37K) โ€” general reasoning
  • interstellarninja/hermes_reasoning_tool_use (~30K) โ€” tool use
  • custom/dpo-toolcode-alignment (~15K) โ€” DPO preference pairs

Phase 3 โ€” Post-Training - Direct Preference Optimization (DPO)

After sequential fine-tuning, the model underwent a DPO phase to enhance response alignment, reasoning robustness, and factual consistency.

  • Learning rate: 1e-6
  • Batch size: 2
  • Gradient accumulation: 8
  • Epochs: 5
  • Beta: 0.2
  • Loss type: sigmoid
  • Warmup steps: 2
  • Sequence length: ~1.5K tokens
DPO Data
  • ~1.5K chosen/rejected response pairs
  • Rejected samples synthetically generated to represent poor or incoherent answers
  • Chosen samples verified or automatically ranked for quality and correctness

Objective - Encourage the model to prefer chosen completions - Improve clarity, correctness, and helpfulness - Reduce hallucinations and verbosity


๐Ÿ“Š Evaluation

The model was evaluated on multiple benchmarks to assess its capabilities across different domains:

Benchmark Score Details
HumanEval 72.0% Base tests
HumanEval+ 68.5% Base + extra tests
GSM8K 82.0% 1082โ„1319 correct
MMLU 77.7% 1190โ„1531 correct (validation split)
Multi-turn Tool Calling 70.0% 70โ„100 correct

Evaluation Datasets

  • HumanEval/HumanEval+: openai/openai_humaneval - 164 hand-written programming problems
  • GSM8K: openai/gsm8k (test split) - 1,319 grade school math word problems
  • MMLU: cais/mmlu (validation split) - 1,531 multiple-choice questions across 57 subjects
  • Tool Calling: Custom dataset - 100 tool calling scenarios

Evaluation Factors

  • Tool calling accuracy
  • Code generation quality
  • General reasoning performance
  • Alignment and factual consistency (post-DPO)

Observations

  • DPO improved reasoning precision and response coherence
  • Code generation accuracy increased in structured programming tasks
  • Reduced non-determinism in multi-step tool use

๐Ÿ–ฅ๏ธ Technical Specifications

Model Architecture

  • Model type: Causal language model
  • Parameters: 4.0B
  • Context length: ~264K tokens
  • Thinking mode: Enabled

Compute Infrastructure

Hardware - GPU: NVIDIA H100 (80 GB VRAM)
- System RAM: 2 TiB
- Memory per vCPU: 10.67 GiB

Software - Python: 3.12
- Transformers: 4.55.0
- Libraries: bitsandbytes, safetensors, torch, trl, scikit-learn, tokenizers, psutil, py7zr


๐Ÿงญ Recommendations

  • Tool use accuracy depends on task complexity
  • Code generation may occasionally produce minor syntax issues
  • Reasoning strongest in structured, logical, and mathematical contexts
  • Avoid using this model for confidential or safety-critical applications

๐Ÿง  Qwen3-4B-AgentCoder-GGUF โ€” created by Bruno Pistone
Enhanced reasoning, tool calling, and code generation โ€” refined with DPO alignment