223 Downloads Updated 1 week ago
ollama run ahmadwaqar/holo-3.1:0.8b
Updated 1 week ago
1 week ago
5b2929b2f34f · 1.3GB ·
Vision-Language Models (VLMs) for computer-use agents: UI grounding, web/desktop automation, mobile automation, and business workflows. Built by H Company on the Qwen 3.5 family, packaged here as GGUF for Ollama with the CLIP vision projector bundled in.
Given a screenshot + an instruction, Holo locates the correct UI element and returns an action (e.g. a click at normalized [0, 1000] coordinates) or a textual answer.
ollama run ahmadwaqar/holo-3.1
All variants live under one repo and are selected by tag.
| Tag | Size | Variant | Quant | Notes |
|---|---|---|---|---|
latest, 4b |
4B | instruct | Q8_0 | Default. Best general accuracy / quality. |
4b-q4 |
4B | instruct | Q4_K_M | Smaller, faster, slightly lower quality. |
4b-thinking |
4B | thinking | Q8_0 | Emits a <think> reasoning plan before acting. |
4b-thinking-q4 |
4B | thinking | Q4_K_M | Thinking, smaller footprint. |
0.8b |
0.8B | instruct | Q8_0 | Ultra-light; fast one-shot grounding on modest hardware. |
0.8b-q4 |
0.8B | instruct | Q4_K_M | Smallest build. |
ollama pull ahmadwaqar/holo-3.1:4b
ollama pull ahmadwaqar/holo-3.1:0.8b
ollama pull ahmadwaqar/holo-3.1:4b-thinking
4b (or 0.8b on lighter machines). These answer directly into content; temperature defaults to 0.0 for deterministic coordinates.4b-thinking. It produces a <think> plan first; defaults to temp 0.6 / top_p 0.95 / top_k 20.-q4 tags (Q4_K_M).Holo is multimodal: send the screenshot and the instruction in the same user turn. Coordinates returned are integers in a normalized [0, 1000] space (origin top-left); scale them to real pixels with px = x / 1000 * image_width, py = y / 1000 * image_height.
You are Holo, a GUI grounding agent for computer-use automation. Given a
screenshot and a task, locate the correct UI element and call the appropriate
tool. Click coordinates must be integers in the [0, 1000] space, normalized to
the provided image with the origin at the top-left corner.
curl http://localhost:11434/api/chat -d '{
"model": "ahmadwaqar/holo-3.1:4b",
"stream": false,
"messages": [
{ "role": "system", "content": "You are Holo, a GUI grounding agent..." },
{
"role": "user",
"content": "Click the search box.",
"images": ["<base64-encoded screenshot>"]
}
]
}'
import ollama
resp = ollama.chat(
model="ahmadwaqar/holo-3.1:4b",
messages=[
{"role": "system", "content": "You are Holo, a GUI grounding agent..."},
{"role": "user", "content": "Click the search box.", "images": ["screenshot.png"]},
],
)
print(resp["message"]["content"])
| instruct tags | thinking tags | |
|---|---|---|
| temperature | 0.0 |
0.6 |
| top_p | 1.0 |
0.95 |
| top_k | — | 20 |
| num_ctx | 8192 |
16384 |
Override per request as needed (e.g. raise num_ctx for long agent histories).
<tool_call><function=name><parameter=...>). The bundled chat template feeds tools and tool history correctly, but Ollama’s parser will not reliably surface that XML as structured tool_calls in the API response — parse the XML from the assistant content on the client side. Plain chat and vision grounding work normally.holo-3.1 even though the model is styled “Holo-3.1”.Apache 2.0. Original models and research by H Company — Holo3.1 family (0.8B / 4B / 9B / 35B-A3B), based on Qwen 3.5. These tags are GGUF conversions (instruct + thinking variants) of Holo3.1-0.8B and Holo3.1-4B.
@misc{hai2026holo31,
title={Holo3.1: Fast & Local Computer Use Agents},
author={H Company},
year={2026},
url={https://huggingface.co/Hcompany/Holo3.1-35B-A3B}
}