28 Downloads Updated 3 weeks ago
ollama run guoxuter/ov_intent_analysis_sft:v4_q8
Updated 3 weeks ago
3 weeks ago
bab5186d07d8 · 812MB ·
Local intent-analysis model for OpenViking retrieval planning.
ov_intent_analysis_sft is a lightweight Q8-quantized Ollama model designed for local deployment with OpenViking and the OpenViking OpenClaw Plugin.
Its main purpose is to decide whether a user turn actually needs context retrieval. For small talk, greetings, or turns where the required context is already covered, the model returns an empty query list, helping avoid unnecessary memory injection and reduce token usage. When retrieval is needed, it emits compact JSON queries for OpenViking context types such as skill, resource, and memory.
skill, resource, and memory.v4_q8 provides lower latency with a smaller output schema.| Tag | Recommended | Description |
|---|---|---|
v4_q8 |
Yes | Compact output, lower latency, requires the v4 prompt template. |
v1_q8 |
Compatible | Works with the original OpenViking intent-analysis prompt. |
Install Ollama:
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
ollama --version
Pull the model:
# Recommended
ollama pull guoxuter/ov_intent_analysis_sft:v4_q8
# Compatible legacy version
ollama pull guoxuter/ov_intent_analysis_sft:v1_q8
Call with the Ollama API:
curl http://127.0.0.1:11434/api/generate -d '{
"model": "guoxuter/ov_intent_analysis_sft:v4_q8",
"prompt": "<your rendered v4 prompt>",
"stream": false,
"think": false,
"format": "json",
"options": {
"temperature": 0,
"num_predict": 1024
}
}'
Production note: the model was not trained with thinking mode. Set
"think": falseto avoid extra latency.
v4_q8 returns a single JSON object:
{
"queries": [
{
"query": "RFC standard template",
"context_type": "resource",
"priority": 1
}
]
}
If no retrieval is needed:
{
"queries": []
}
| Type | Meaning | Query Style |
|---|---|---|
skill |
Executable capability, tool, function, API, automation | Imperative verb phrase, e.g. Create RFC document |
resource |
Knowledge artifact, document, spec, guide, code, config | Noun phrase, e.g. RFC standard template |
memory |
User preference or agent execution experience | User's ..., Experience executing ..., or System insights about ... |
metadata:
id: "retrieval.intent_analysis"
name: "Intent Analysis v4"
description: "v4 prompt for compact intent-analysis models that emit only queries."
version: "4.0.0"
language: "en"
category: "retrieval"
template: |
You are OpenViking's context query planner. Given the session context and the current message, decide what context information is missing and emit retrieval queries to fill the gap.
## Session Context
### Session Summary
{{ compression_summary }}
### Recent Conversation
{{ recent_messages }}
### Current Message
{{ current_message }}
{% if context_type %}
## Search Scope Constraints
**Restricted Context Type**: {{ context_type }}
{% if target_abstract %}
**Target Directory Abstract**: {{ target_abstract }}
{% endif %}
Only emit `{{ context_type }}` queries; do not generate other types.
{% endif %}
External information takes priority over built-in knowledge - actively query for any missing context.
## Procedure
1. Classify the task - operational tasks typically need skill+resource+memory; informational tasks typically need resource+memory; conversational small talk needs no query.
2. Skip any context type already covered explicitly in the conversation.
3. For each needed type, emit 1-5 concise retrievable queries with `priority` from 1 to 5.
## Output Format
Output a single JSON object with exactly one top-level key:
- `queries`: array of objects with:
- `query`: actual query text
- `context_type`: one of `skill`, `resource`, `memory`
- `priority`: integer from 1 to 5
If no query is needed, set `queries` to an empty array `[]`.
Output the JSON object directly. Do not wrap it in markdown code fences.
llm_config:
temperature: 0.0
import json
import requests
OLLAMA_URL = "http://127.0.0.1:11434/api/generate"
MODEL = "guoxuter/ov_intent_analysis_sft:v4_q8"
payload = {
"model": MODEL,
"prompt": "<your rendered v4 prompt>",
"stream": False,
"think": False,
"format": "json",
"options": {
"temperature": 0,
"num_predict": 1024,
},
}
response = requests.post(OLLAMA_URL, json=payload, timeout=120)
response.raise_for_status()
body = response.json()
result = json.loads(body["response"])
print(json.dumps(result, ensure_ascii=False, indent=2))
Benchmark environment: MacBook Pro, Apple M2 Pro, 12-core CPU (8 performance + 4 efficiency), 19-core GPU, 32 GB memory.
| Model / Method | Locomo Accuracy | ChitChat F1 | GPU Time | CPU Time | Quantization |
|---|---|---|---|---|---|
| doubao-seed-2.0-pro | 0.9032 | 0.9176 | - | - | None |
| qwen3.5-0.8b base | - | 0.1556 | 7.78 | 12.74 | 8 bit |
v1_q8 |
0.8955 | 0.9070 | 6.95 | 12.13 | 8 bit |
v4_q8 |
0.8890 | 0.9176 | 2.86 | 5.57 | 8 bit |
v4_q8 for new integrations.v4_q8; use the original prompt only for v1_q8.temperature to 0 for deterministic JSON output.format to "json" to reduce parsing failures."think": false in production.num_predict if the rendered prompt is long.find calls with intent-aware search planning.