16.7K Downloads Updated 6 days ago
ollama run orieg/gemma3-tools:12b-ft
Updated 1 week ago
1 week ago
9093ab43e762 · 7.3GB ·
Gemma 3 IT models fine-tuned for reliable tool calling via <tool_call> XML tags. Based on Google’s Gemma 3 with QLoRA fine-tuning on NousResearch/hermes-function-calling-v1 (11,578 examples).
| Tag | Size | Quant | Context | Best For |
|---|---|---|---|---|
27b-ft |
15.8 GB | Q4_K_M | 40K | Agentic pipelines with explicit tool prompts |
12b-ft |
7.3 GB | Q4_K_M | 65K | Best overall - highest accuracy across categories |
12b-ft-v2 |
6.8 GB | Q4_K_M | 32K | v2 with Claude Code training data (marginal improvement) |
4b-ft |
2.5 GB | Q4_K_M | 32K | Lightweight - explicit prompts only, strong tool-call bias |
latest |
15.8 GB | Q4_K_M | 40K | Alias for 27b-ft |
ollama run orieg/gemma3-tools:12b-ft
curl http://localhost:11434/api/chat -d '{
"model": "orieg/gemma3-tools:12b-ft",
"stream": false,
"messages": [{"role": "user", "content": "What is the weather in Paris?"}],
"tools": [{"type": "function", "function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {"type": "object", "properties": {"location": {"type": "string"}}, "required": ["location"]}
}}]
}'
Returns structured tool_calls with <tool_call> XML format - parsed natively by Ollama into the response message.tool_calls array.
Tested across four categories with 8 and 22 tools using the native Ollama API:
Fine-tuned models compared against their base counterparts (same template, unmodified weights). C-no-tool results are from direct manual verification (automated batch testing was unreliable due to model-swapping between runs).
| Model | A-explicit | B-natural | C-no-tool | D-disambig | Overall |
|---|---|---|---|---|---|
| 4b (base) | 50% | 40% | 0% | 75% | 41% |
| 4b-ft | 80% | 20% | 0% | 50% | 38% |
| 12b (base) | 100% | 60% | ~75% | 87% | ~68% |
| 12b-ft | 100% | 50% | ~100% | 100% | ~88% |
| 12b-ft-v2 | 90% | 50% | ~75% | 100% | ~79% |
| 27b (base) | 100% | 40% | ~100% | 100% | ~85% |
| 27b-ft | 80% | 50% | ~75% | 50% | ~64% |
What fine-tuning actually adds: Tool selection accuracy is roughly the same between base and fine-tuned models - the base 12b and 27b already perform well. The key benefit of fine-tuning is reliable <tool_call> XML format compliance: base models occasionally produce the correct tool intent in a JSON markdown block or other format that Ollama cannot parse into tool_calls. Fine-tuned models consistently use the expected XML tags.
C-no-tool note: The 4b models have a genuine tool-call bias regardless of prompt. The 12b and 27b models (base and fine-tuned) correctly answer conversational questions in plain text even when tools are available. No-tool accuracy is primarily a model size issue, not a fine-tuning artifact.
ollama-agent sidesteps most issues by using a tightly scoped system prompt and a small fixed toolset (shell, file read/write, grep, memory, RAG). In this setup the tool-call bias is a feature - the model is always expected to use tools.
| Prompt | 27b-ft | 12b-ft |
|---|---|---|
| “list all files in this folder” | Yes ls |
Yes ls |
| “what files are here?” | Yes ls |
Yes ls |
| “read the README.md file” | Yes read_file |
Yes read_file |
| “search for TODO in the code” | Yes grep |
Yes grep |
| “what is this project about?” | No (text) | - |
| “create a file called test.txt…” | Yes write_file |
- |
| Accuracy | 83% | 100% (3⁄3 tested) |
Good fit: - Agentic pipelines where the model is always expected to use a tool (every turn calls a tool) - Controlled system prompts with 5-15 tools and clear descriptions - Explicit user prompts (the user states what to do, not just what they want) - ollama-agent or similar minimal-prompt frameworks
Poor fit: - Mixed chat + tool-calling with a 4b model (strong tool-call bias regardless of prompt) - Large system prompts with 22+ tools (accuracy degrades, format errors increase) - Agentic coding assistants like Claude Code (see below)
Claude Code is not recommended with any of these models, even with large context:
Use a purpose-built model (e.g. a Claude model via the Anthropic API) for Claude Code tool calling.
The 27b-ft model requires KV cache quantization to fit with usable context:
# Add to Ollama service environment:
OLLAMA_KV_CACHE_TYPE=q4_0
OLLAMA_FLASH_ATTENTION=1
On systemd-based systems:
sudo tee /etc/systemd/system/ollama.service.d/kv-cache.conf << EOF
[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q4_0"
Environment="OLLAMA_FLASH_ATTENTION=1"
EOF
sudo systemctl daemon-reload && sudo systemctl restart ollama
| Config | Model | KV Cache | Total | Fits 24GB? |
|---|---|---|---|---|
| 27b-ft, 40K ctx, q4_0 KV | 15.8 GB | ~5.7 GB | ~22 GB | Yes |
| 27b-ft, 28K ctx, q8_0 KV | 15.8 GB | ~5.7 GB | ~21.5 GB | Yes |
| 27b-ft, 28K ctx, FP16 KV | 15.8 GB | ~14 GB | ~30 GB | No OOM |
| 12b-ft, 65K ctx, q4_0 KV | 7.3 GB | ~8.4 GB | ~16 GB | Yes |
| 12b-ft, 65K ctx, q8_0 KV | 7.3 GB | ~12 GB | ~19 GB | Yes |
| 4b-ft, 32K ctx, q4_0 KV | 2.5 GB | ~1.7 GB | ~4.2 GB | Yes |
pip install ollama-agent
ollama-agent -m orieg/gemma3-tools:12b-ft
ollama-agent uses a lightweight system prompt with 5-8 tools (shell, file read/write, grep, memory, RAG). This is the sweet spot for these models - 12b-ft achieves 100% accuracy in this setup.
The 4b model has a strong bias toward emitting tool calls even for conversational prompts like “what is 2+2?”. This is a model size issue - the 12b and 27b variants (both base and fine-tuned) handle mixed chat + tool-calling correctly. If you need to use the 4b model in a context with conversational turns, add an explicit system prompt instruction:
Only call a tool when the user asks you to perform an action.
For questions and explanations, respond in plain text.
Starting with Ollama 0.5.x, the Go runner allocates a full context-size KV cache for all layers - including Sliding Window Attention (SWA) layers that previously only needed ~1K tokens. For Gemma 3 27b at 28K context:
Workaround: Use OLLAMA_KV_CACHE_TYPE=q4_0 (or q8_0) to quantize the KV cache. This reduces KV memory by 4x (or 2x), making large contexts viable again.
Ollama silently truncates prompts that exceed num_ctx. With Claude Code’s ~35K token tool prompt and a model set to num_ctx=12288, you get:
truncating input prompt limit=12288 prompt=35946 keep=5 new=12288
The tool definitions are stripped, and the model never sees them. Always ensure num_ctx exceeds your expected prompt size.
At 15-22 tools, models occasionally fall back to emitting {"name": ..., "arguments": ...} inside a markdown code block instead of proper <tool_call> XML tags. Ollama does not parse these as tool_calls - they appear as plain content. Streaming helps slightly; non-streaming is more prone to this.
unsloth/gemma-3-27b-it (27b) / unsloth/gemma-3-12b-it (12b) / unsloth/gemma-3-4b-it (4b)Uses Gemma 3 chat format with tool-call support:
<start_of_turn>user
{system prompt + tools}
{user message}<end_of_turn>
<start_of_turn>model
<tool_call>
{"name": "function_name", "arguments": {"key": "value"}}
</tool_call><end_of_turn>
Tool responses are passed back as:
<start_of_turn>user
<tool_response>
{tool output}
</tool_response><end_of_turn>
Based on Google’s Gemma 3 models. Subject to the Gemma Terms of Use.