Holo-3.1 vision-language computer-use agents by H Company. Locate UI elements and drive web, desktop & mobile automation from a screenshot — returns clicks in normalized [0,1000] coords. 0.8B & 4B, instruct & thinking variants, Q4_K_M/Q8_0. Apache 2.0.

Details

Updated 1 week ago

1 week ago

5b2929b2f34f · 1.3GB ·

model

archqwen35

parameters1.01B

quantizationQ8_0

1.1GB

projector

archclip

parameters101M

quantizationBF16

207MB

params

{ "num_ctx": 8192, "stop": [ "<|im_end|>", "<|endoftext|>" ], "tempe

101B

template

{{- if or .System .Tools }}<|im_start|>system {{- if .Tools }} # Tools You have access to the follow

1.7kB

Holo-3.1 — Fast & Local Computer-Use Agents

Vision-Language Models (VLMs) for computer-use agents: UI grounding, web/desktop automation, mobile automation, and business workflows. Built by H Company on the Qwen 3.5 family, packaged here as GGUF for Ollama with the CLIP vision projector bundled in.

Given a screenshot + an instruction, Holo locates the correct UI element and returns an action (e.g. a click at normalized [0, 1000] coordinates) or a textual answer.

ollama run ahmadwaqar/holo-3.1

Available tags

All variants live under one repo and are selected by tag.

Tag	Size	Variant	Quant	Notes
`latest`, `4b`	4B	instruct	Q8_0	Default. Best general accuracy / quality.
`4b-q4`	4B	instruct	Q4_K_M	Smaller, faster, slightly lower quality.
`4b-thinking`	4B	thinking	Q8_0	Emits a `<think>` reasoning plan before acting.
`4b-thinking-q4`	4B	thinking	Q4_K_M	Thinking, smaller footprint.
`0.8b`	0.8B	instruct	Q8_0	Ultra-light; fast one-shot grounding on modest hardware.
`0.8b-q4`	0.8B	instruct	Q4_K_M	Smallest build.

ollama pull ahmadwaqar/holo-3.1:4b
ollama pull ahmadwaqar/holo-3.1:0.8b
ollama pull ahmadwaqar/holo-3.1:4b-thinking

Which one should I use?

Single-shot UI grounding / element localization → 4b (or 0.8b on lighter machines). These answer directly into content; temperature defaults to 0.0 for deterministic coordinates.
Multi-step agent loops / planning before acting → 4b-thinking. It produces a <think> plan first; defaults to temp 0.6 / top_p 0.95 / top_k 20.
Tight memory / CPU → the -q4 tags (Q4_K_M).

Usage

Holo is multimodal: send the screenshot and the instruction in the same user turn. Coordinates returned are integers in a normalized [0, 1000] space (origin top-left); scale them to real pixels with px = x / 1000 * image_width, py = y / 1000 * image_height.

Recommended system prompt

You are Holo, a GUI grounding agent for computer-use automation. Given a
screenshot and a task, locate the correct UI element and call the appropriate
tool. Click coordinates must be integers in the [0, 1000] space, normalized to
the provided image with the origin at the top-left corner.

API example (chat with an image)

curl http://localhost:11434/api/chat -d '{
  "model": "ahmadwaqar/holo-3.1:4b",
  "stream": false,
  "messages": [
    { "role": "system", "content": "You are Holo, a GUI grounding agent..." },
    {
      "role": "user",
      "content": "Click the search box.",
      "images": ["<base64-encoded screenshot>"]
    }
  ]
}'

Python (ollama package)

import ollama

resp = ollama.chat(
    model="ahmadwaqar/holo-3.1:4b",
    messages=[
        {"role": "system", "content": "You are Holo, a GUI grounding agent..."},
        {"role": "user", "content": "Click the search box.", "images": ["screenshot.png"]},
    ],
)
print(resp["message"]["content"])

Defaults

	instruct tags	thinking tags
temperature	`0.0`	`0.6`
top_p	`1.0`	`0.95`
top_k	—	`20`
num_ctx	`8192`	`16384`

Override per request as needed (e.g. raise num_ctx for long agent histories).

Notes

Tool calling: Holo was trained with a custom XML function-call format (<tool_call><function=name><parameter=...>). The bundled chat template feeds tools and tool history correctly, but Ollama’s parser will not reliably surface that XML as structured tool_calls in the API response — parse the XML from the assistant content on the client side. Plain chat and vision grounding work normally.
Name casing: Ollama lowercases all repo names, so the pull name is holo-3.1 even though the model is styled “Holo-3.1”.

License & attribution

Apache 2.0. Original models and research by H Company — Holo3.1 family (0.8B / 4B / 9B / 35B-A3B), based on Qwen 3.5. These tags are GGUF conversions (instruct + thinking variants) of Holo3.1-0.8B and Holo3.1-4B.

@misc{hai2026holo31,
  title={Holo3.1: Fast & Local Computer Use Agents},
  author={H Company},
  year={2026},
  url={https://huggingface.co/Hcompany/Holo3.1-35B-A3B}
}