Details

Updated 3 months ago

3 months ago

9093ab43e762 · 7.3GB ·

model

archgemma3

parameters11.8B

quantizationQ4_K_M

7.3GB

params

{ "num_ctx": 32768, "num_predict": -1, "stop": [ "<end_of_turn>" ], "tem

109B

template

{{- if or .System .Tools }}<start_of_turn>user {{- if .System }} {{ .System }} {{- end }} {{- if .To

1.4kB

gemma3-tools

Gemma 3 IT models fine-tuned for reliable tool calling via <tool_call> XML tags. Based on Google’s Gemma 3 with QLoRA fine-tuning on NousResearch/hermes-function-calling-v1 (11,578 examples).

Available Tags

Tag	Size	Quant	Context	Best For
`27b-ft`	15.8 GB	Q4_K_M	40K	Agentic pipelines with explicit tool prompts
`12b-ft`	7.3 GB	Q4_K_M	65K	Best overall - highest accuracy across categories
`12b-ft-v2`	6.8 GB	Q4_K_M	32K	v2 with Claude Code training data (marginal improvement)
`4b-ft`	2.5 GB	Q4_K_M	32K	Lightweight - explicit prompts only, strong tool-call bias
`latest`	15.8 GB	Q4_K_M	40K	Alias for `27b-ft`

Quick Start

ollama run orieg/gemma3-tools:12b-ft

API Usage (tool calling)

curl http://localhost:11434/api/chat -d '{
  "model": "orieg/gemma3-tools:12b-ft",
  "stream": false,
  "messages": [{"role": "user", "content": "What is the weather in Paris?"}],
  "tools": [{"type": "function", "function": {
    "name": "get_weather",
    "description": "Get weather for a location",
    "parameters": {"type": "object", "properties": {"location": {"type": "string"}}, "required": ["location"]}
  }}]
}'

Returns structured tool_calls with <tool_call> XML format - parsed natively by Ollama into the response message.tool_calls array.

Tool-Calling Reliability

Tested across four categories with 8 and 22 tools using the native Ollama API:

A - Explicit: prompt directly names the action (“Run the command: ls -la”, “Read the file X”)
B - Natural: prompt uses natural language (“What files are here?”, “How much disk space?”)
C - No-tool: prompt should be answered in text, no tool needed (“What is 2+2?”, “Explain recursion”)
D - Disambig: pick the right tool from several similar ones (grep vs list_dir vs bash)

Direct API results (8 and 22 tools combined)

Fine-tuned models compared against their base counterparts (same template, unmodified weights). C-no-tool results are from direct manual verification (automated batch testing was unreliable due to model-swapping between runs).

Model	A-explicit	B-natural	C-no-tool	D-disambig	Overall
4b (base)	50%	40%	0%	75%	41%
4b-ft	80%	20%	0%	50%	38%
12b (base)	100%	60%	~75%	87%	~68%
12b-ft	100%	50%	~100%	100%	~88%
12b-ft-v2	90%	50%	~75%	100%	~79%
27b (base)	100%	40%	~100%	100%	~85%
27b-ft	80%	50%	~75%	50%	~64%

What fine-tuning actually adds: Tool selection accuracy is roughly the same between base and fine-tuned models - the base 12b and 27b already perform well. The key benefit of fine-tuning is reliable <tool_call> XML format compliance: base models occasionally produce the correct tool intent in a JSON markdown block or other format that Ollama cannot parse into tool_calls. Fine-tuned models consistently use the expected XML tags.

C-no-tool note: The 4b models have a genuine tool-call bias regardless of prompt. The 12b and 27b models (base and fine-tuned) correctly answer conversational questions in plain text even when tools are available. No-tool accuracy is primarily a model size issue, not a fine-tuning artifact.

ollama-agent (5-8 tools, recommended client)

ollama-agent sidesteps most issues by using a tightly scoped system prompt and a small fixed toolset (shell, file read/write, grep, memory, RAG). In this setup the tool-call bias is a feature - the model is always expected to use tools.

Prompt	27b-ft	12b-ft
“list all files in this folder”	Yes `ls`	Yes `ls`
“what files are here?”	Yes `ls`	Yes `ls`
“read the README.md file”	Yes `read_file`	Yes `read_file`
“search for TODO in the code”	Yes `grep`	Yes `grep`
“what is this project about?”	No (text)	-
“create a file called test.txt…”	Yes `write_file`	-
Accuracy	83%	100% (³⁄₃ tested)

When to Use These Models

Good fit: - Agentic pipelines where the model is always expected to use a tool (every turn calls a tool) - Controlled system prompts with 5-15 tools and clear descriptions - Explicit user prompts (the user states what to do, not just what they want) - ollama-agent or similar minimal-prompt frameworks

Poor fit: - Mixed chat + tool-calling with a 4b model (strong tool-call bias regardless of prompt) - Large system prompts with 22+ tools (accuracy degrades, format errors increase) - Agentic coding assistants like Claude Code (see below)

Claude Code - Not Recommended

Claude Code is not recommended with any of these models, even with large context:

Tool-call bias: 0% accuracy on no-tool prompts means the model will emit spurious tool calls during normal conversation, breaking the agent loop
System prompt size: Claude Code sends a ~20K token system prompt with 22+ tool definitions. All models show degraded accuracy at this scale
Natural language prompts: Claude Code uses natural language to request actions (“list files in this folder”) rather than explicit tool references, which these models handle poorly (20-50% accuracy)
Even with more context: Providing a larger context window does not fix the tool-call bias or natural language understanding issues - it only addresses prompt truncation, which is the least critical problem

Use a purpose-built model (e.g. a Claude model via the Anthropic API) for Claude Code tool calling.

Recommended Setup

For 24GB VRAM GPUs (RTX ³⁰⁹⁰⁄₄₀₉₀)

The 27b-ft model requires KV cache quantization to fit with usable context:

# Add to Ollama service environment:
OLLAMA_KV_CACHE_TYPE=q4_0
OLLAMA_FLASH_ATTENTION=1

On systemd-based systems:

sudo tee /etc/systemd/system/ollama.service.d/kv-cache.conf << EOF
[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q4_0"
Environment="OLLAMA_FLASH_ATTENTION=1"
EOF
sudo systemctl daemon-reload && sudo systemctl restart ollama

VRAM Budget

Config	Model	KV Cache	Total	Fits 24GB?
27b-ft, 40K ctx, q4_0 KV	15.8 GB	~5.7 GB	~22 GB	Yes
27b-ft, 28K ctx, q8_0 KV	15.8 GB	~5.7 GB	~21.5 GB	Yes
27b-ft, 28K ctx, FP16 KV	15.8 GB	~14 GB	~30 GB	No OOM
12b-ft, 65K ctx, q4_0 KV	7.3 GB	~8.4 GB	~16 GB	Yes
12b-ft, 65K ctx, q8_0 KV	7.3 GB	~12 GB	~19 GB	Yes
4b-ft, 32K ctx, q4_0 KV	2.5 GB	~1.7 GB	~4.2 GB	Yes

With ollama-agent

pip install ollama-agent
ollama-agent -m orieg/gemma3-tools:12b-ft

ollama-agent uses a lightweight system prompt with 5-8 tools (shell, file read/write, grep, memory, RAG). This is the sweet spot for these models - 12b-ft achieves 100% accuracy in this setup.

Known Limitations

Tool-Call Bias (4b models)

The 4b model has a strong bias toward emitting tool calls even for conversational prompts like “what is 2+2?”. This is a model size issue - the 12b and 27b variants (both base and fine-tuned) handle mixed chat + tool-calling correctly. If you need to use the 4b model in a context with conversational turns, add an explicit system prompt instruction:

Only call a tool when the user asks you to perform an action.
For questions and explanations, respond in plain text.

Ollama SWA KV Cache Regression

Starting with Ollama 0.5.x, the Go runner allocates a full context-size KV cache for all layers - including Sliding Window Attention (SWA) layers that previously only needed ~1K tokens. For Gemma 3 27b at 28K context:

Old engine: model (15.8 GB) + SWA KV (~2.3 GB) + global KV (~4.7 GB) = ~22.8 GB Yes
New engine: model (15.8 GB) + full KV (~14 GB) = ~29.8 GB No OOM

Workaround: Use OLLAMA_KV_CACHE_TYPE=q4_0 (or q8_0) to quantize the KV cache. This reduces KV memory by 4x (or 2x), making large contexts viable again.

Context Truncation

Ollama silently truncates prompts that exceed num_ctx. With Claude Code’s ~35K token tool prompt and a model set to num_ctx=12288, you get:

truncating input prompt limit=12288 prompt=35946 keep=5 new=12288

The tool definitions are stripped, and the model never sees them. Always ensure num_ctx exceeds your expected prompt size.

Format Errors at Scale

At 15-22 tools, models occasionally fall back to emitting {"name": ..., "arguments": ...} inside a markdown code block instead of proper <tool_call> XML tags. Ollama does not parse these as tool_calls - they appear as plain content. Streaming helps slightly; non-streaming is more prone to this.

Fine-Tuning Details

Base model: unsloth/gemma-3-27b-it (27b) / unsloth/gemma-3-12b-it (12b) / unsloth/gemma-3-4b-it (4b)
Method: QLoRA (r=16, alpha=32, dropout=0.05)
Dataset: NousResearch/hermes-function-calling-v1 (11,578 examples, 5 configs)
Training: 1 epoch, batch_size=1, grad_accum=8, cosine LR schedule, lr=1e-4
Packing: Manual greedy bin-packing (Unsloth skips packing for multimodal architectures)
Hardware: NVIDIA RTX 3090 (24GB), ~10 hours per training run
v2 addition: 70 synthetic examples with captured Claude Code system prompt (12b-ft-v2 only)

Template

Uses Gemma 3 chat format with tool-call support:

<start_of_turn>user
{system prompt + tools}
{user message}<end_of_turn>
<start_of_turn>model
<tool_call>
{"name": "function_name", "arguments": {"key": "value"}}
</tool_call><end_of_turn>

Tool responses are passed back as:

<start_of_turn>user
<tool_response>
{tool output}
</tool_response><end_of_turn>

License

Based on Google’s Gemma 3 models. Subject to the Gemma Terms of Use.

Gemma3 model with tools support and large context window (optimized for RTX3090 24GB VRAM)

Details

Readme

gemma3-tools

Available Tags

Quick Start

API Usage (tool calling)

Tool-Calling Reliability

Direct API results (8 and 22 tools combined)

ollama-agent (5-8 tools, recommended client)

When to Use These Models

Claude Code - Not Recommended

Recommended Setup

For 24GB VRAM GPUs (RTX ³⁰⁹⁰⁄₄₀₉₀)

VRAM Budget

With ollama-agent

Known Limitations

Tool-Call Bias (4b models)

Ollama SWA KV Cache Regression

Context Truncation

Format Errors at Scale

Fine-Tuning Details

Template

License

Gemma3 model with tools support and large context window (optimized for RTX3090 24GB VRAM)

Details

Readme

gemma3-tools

Available Tags

Quick Start

API Usage (tool calling)

Tool-Calling Reliability

Direct API results (8 and 22 tools combined)

ollama-agent (5-8 tools, recommended client)

When to Use These Models

Claude Code - Not Recommended

Recommended Setup

For 24GB VRAM GPUs (RTX 3090⁄4090)

VRAM Budget

With ollama-agent

Known Limitations

Tool-Call Bias (4b models)

Ollama SWA KV Cache Regression

Context Truncation

Format Errors at Scale

Fine-Tuning Details

Template

License

For 24GB VRAM GPUs (RTX ³⁰⁹⁰⁄₄₀₉₀)