16.7K 6 days ago

Gemma3 model with tools support and large context window (optimized for RTX3090 24GB VRAM)

vision tools
ollama run orieg/gemma3-tools:27b-ft

Details

1 week ago

d03e7b59d102 · 17GB ·

gemma3
·
27B
·
Q4_K_M
clip
·
423M
·
BF16
{{- if or .System .Tools }}<start_of_turn>user {{- if .System }} {{ .System }} {{- end }} {{- if .To
{ "min_p": 0, "num_ctx": 28872, "num_predict": -1, "stop": [ "<end_of_turn>"

Readme

gemma3-tools

Gemma 3 IT models fine-tuned for reliable tool calling via <tool_call> XML tags. Based on Google’s Gemma 3 with QLoRA fine-tuning on NousResearch/hermes-function-calling-v1 (11,578 examples).

Available Tags

Tag Size Quant Context Best For
27b-ft 15.8 GB Q4_K_M 40K Agentic pipelines with explicit tool prompts
12b-ft 7.3 GB Q4_K_M 65K Best overall - highest accuracy across categories
12b-ft-v2 6.8 GB Q4_K_M 32K v2 with Claude Code training data (marginal improvement)
4b-ft 2.5 GB Q4_K_M 32K Lightweight - explicit prompts only, strong tool-call bias
latest 15.8 GB Q4_K_M 40K Alias for 27b-ft

Quick Start

ollama run orieg/gemma3-tools:12b-ft

API Usage (tool calling)

curl http://localhost:11434/api/chat -d '{
  "model": "orieg/gemma3-tools:12b-ft",
  "stream": false,
  "messages": [{"role": "user", "content": "What is the weather in Paris?"}],
  "tools": [{"type": "function", "function": {
    "name": "get_weather",
    "description": "Get weather for a location",
    "parameters": {"type": "object", "properties": {"location": {"type": "string"}}, "required": ["location"]}
  }}]
}'

Returns structured tool_calls with <tool_call> XML format - parsed natively by Ollama into the response message.tool_calls array.

Tool-Calling Reliability

Tested across four categories with 8 and 22 tools using the native Ollama API:

  • A - Explicit: prompt directly names the action (“Run the command: ls -la”, “Read the file X”)
  • B - Natural: prompt uses natural language (“What files are here?”, “How much disk space?”)
  • C - No-tool: prompt should be answered in text, no tool needed (“What is 2+2?”, “Explain recursion”)
  • D - Disambig: pick the right tool from several similar ones (grep vs list_dir vs bash)

Direct API results (8 and 22 tools combined)

Fine-tuned models compared against their base counterparts (same template, unmodified weights). C-no-tool results are from direct manual verification (automated batch testing was unreliable due to model-swapping between runs).

Model A-explicit B-natural C-no-tool D-disambig Overall
4b (base) 50% 40% 0% 75% 41%
4b-ft 80% 20% 0% 50% 38%
12b (base) 100% 60% ~75% 87% ~68%
12b-ft 100% 50% ~100% 100% ~88%
12b-ft-v2 90% 50% ~75% 100% ~79%
27b (base) 100% 40% ~100% 100% ~85%
27b-ft 80% 50% ~75% 50% ~64%

What fine-tuning actually adds: Tool selection accuracy is roughly the same between base and fine-tuned models - the base 12b and 27b already perform well. The key benefit of fine-tuning is reliable <tool_call> XML format compliance: base models occasionally produce the correct tool intent in a JSON markdown block or other format that Ollama cannot parse into tool_calls. Fine-tuned models consistently use the expected XML tags.

C-no-tool note: The 4b models have a genuine tool-call bias regardless of prompt. The 12b and 27b models (base and fine-tuned) correctly answer conversational questions in plain text even when tools are available. No-tool accuracy is primarily a model size issue, not a fine-tuning artifact.

ollama-agent (5-8 tools, recommended client)

ollama-agent sidesteps most issues by using a tightly scoped system prompt and a small fixed toolset (shell, file read/write, grep, memory, RAG). In this setup the tool-call bias is a feature - the model is always expected to use tools.

Prompt 27b-ft 12b-ft
“list all files in this folder” Yes ls Yes ls
“what files are here?” Yes ls Yes ls
“read the README.md file” Yes read_file Yes read_file
“search for TODO in the code” Yes grep Yes grep
“what is this project about?” No (text) -
“create a file called test.txt…” Yes write_file -
Accuracy 83% 100% (33 tested)

When to Use These Models

Good fit: - Agentic pipelines where the model is always expected to use a tool (every turn calls a tool) - Controlled system prompts with 5-15 tools and clear descriptions - Explicit user prompts (the user states what to do, not just what they want) - ollama-agent or similar minimal-prompt frameworks

Poor fit: - Mixed chat + tool-calling with a 4b model (strong tool-call bias regardless of prompt) - Large system prompts with 22+ tools (accuracy degrades, format errors increase) - Agentic coding assistants like Claude Code (see below)

Claude Code - Not Recommended

Claude Code is not recommended with any of these models, even with large context:

  1. Tool-call bias: 0% accuracy on no-tool prompts means the model will emit spurious tool calls during normal conversation, breaking the agent loop
  2. System prompt size: Claude Code sends a ~20K token system prompt with 22+ tool definitions. All models show degraded accuracy at this scale
  3. Natural language prompts: Claude Code uses natural language to request actions (“list files in this folder”) rather than explicit tool references, which these models handle poorly (20-50% accuracy)
  4. Even with more context: Providing a larger context window does not fix the tool-call bias or natural language understanding issues - it only addresses prompt truncation, which is the least critical problem

Use a purpose-built model (e.g. a Claude model via the Anthropic API) for Claude Code tool calling.

Recommended Setup

For 24GB VRAM GPUs (RTX 30904090)

The 27b-ft model requires KV cache quantization to fit with usable context:

# Add to Ollama service environment:
OLLAMA_KV_CACHE_TYPE=q4_0
OLLAMA_FLASH_ATTENTION=1

On systemd-based systems:

sudo tee /etc/systemd/system/ollama.service.d/kv-cache.conf << EOF
[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q4_0"
Environment="OLLAMA_FLASH_ATTENTION=1"
EOF
sudo systemctl daemon-reload && sudo systemctl restart ollama

VRAM Budget

Config Model KV Cache Total Fits 24GB?
27b-ft, 40K ctx, q4_0 KV 15.8 GB ~5.7 GB ~22 GB Yes
27b-ft, 28K ctx, q8_0 KV 15.8 GB ~5.7 GB ~21.5 GB Yes
27b-ft, 28K ctx, FP16 KV 15.8 GB ~14 GB ~30 GB No OOM
12b-ft, 65K ctx, q4_0 KV 7.3 GB ~8.4 GB ~16 GB Yes
12b-ft, 65K ctx, q8_0 KV 7.3 GB ~12 GB ~19 GB Yes
4b-ft, 32K ctx, q4_0 KV 2.5 GB ~1.7 GB ~4.2 GB Yes

With ollama-agent

pip install ollama-agent
ollama-agent -m orieg/gemma3-tools:12b-ft

ollama-agent uses a lightweight system prompt with 5-8 tools (shell, file read/write, grep, memory, RAG). This is the sweet spot for these models - 12b-ft achieves 100% accuracy in this setup.

Known Limitations

Tool-Call Bias (4b models)

The 4b model has a strong bias toward emitting tool calls even for conversational prompts like “what is 2+2?”. This is a model size issue - the 12b and 27b variants (both base and fine-tuned) handle mixed chat + tool-calling correctly. If you need to use the 4b model in a context with conversational turns, add an explicit system prompt instruction:

Only call a tool when the user asks you to perform an action.
For questions and explanations, respond in plain text.

Ollama SWA KV Cache Regression

Starting with Ollama 0.5.x, the Go runner allocates a full context-size KV cache for all layers - including Sliding Window Attention (SWA) layers that previously only needed ~1K tokens. For Gemma 3 27b at 28K context:

  • Old engine: model (15.8 GB) + SWA KV (~2.3 GB) + global KV (~4.7 GB) = ~22.8 GB Yes
  • New engine: model (15.8 GB) + full KV (~14 GB) = ~29.8 GB No OOM

Workaround: Use OLLAMA_KV_CACHE_TYPE=q4_0 (or q8_0) to quantize the KV cache. This reduces KV memory by 4x (or 2x), making large contexts viable again.

Context Truncation

Ollama silently truncates prompts that exceed num_ctx. With Claude Code’s ~35K token tool prompt and a model set to num_ctx=12288, you get:

truncating input prompt limit=12288 prompt=35946 keep=5 new=12288

The tool definitions are stripped, and the model never sees them. Always ensure num_ctx exceeds your expected prompt size.

Format Errors at Scale

At 15-22 tools, models occasionally fall back to emitting {"name": ..., "arguments": ...} inside a markdown code block instead of proper <tool_call> XML tags. Ollama does not parse these as tool_calls - they appear as plain content. Streaming helps slightly; non-streaming is more prone to this.

Fine-Tuning Details

  • Base model: unsloth/gemma-3-27b-it (27b) / unsloth/gemma-3-12b-it (12b) / unsloth/gemma-3-4b-it (4b)
  • Method: QLoRA (r=16, alpha=32, dropout=0.05)
  • Dataset: NousResearch/hermes-function-calling-v1 (11,578 examples, 5 configs)
  • Training: 1 epoch, batch_size=1, grad_accum=8, cosine LR schedule, lr=1e-4
  • Packing: Manual greedy bin-packing (Unsloth skips packing for multimodal architectures)
  • Hardware: NVIDIA RTX 3090 (24GB), ~10 hours per training run
  • v2 addition: 70 synthetic examples with captured Claude Code system prompt (12b-ft-v2 only)

Template

Uses Gemma 3 chat format with tool-call support:

<start_of_turn>user
{system prompt + tools}
{user message}<end_of_turn>
<start_of_turn>model
<tool_call>
{"name": "function_name", "arguments": {"key": "value"}}
</tool_call><end_of_turn>

Tool responses are passed back as:

<start_of_turn>user
<tool_response>
{tool output}
</tool_response><end_of_turn>

License

Based on Google’s Gemma 3 models. Subject to the Gemma Terms of Use.