72 Downloads Updated 4 days ago
ollama run guoxuter/ov_intent_analysis_sft:v1_q8
Updated 1 month ago
1 month ago
1c9ee352986d · 812MB ·
Local intent-analysis model for OpenViking retrieval planning.
ov_intent_analysis_sft is a lightweight Q8-quantized Ollama model designed for local deployment with OpenViking and the OpenViking OpenClaw Plugin.
Its main purpose is to decide whether a user turn actually needs context retrieval. For small talk, greetings, or turns where the required context is already covered, the model returns an empty query list, helping avoid unnecessary memory injection and reduce token usage. When retrieval is needed, it emits compact JSON queries for OpenViking context types such as skill, resource, and memory.
skill, resource, and memory.v7_q8 writes declarative, embedding-friendly queries for stronger semantic retrieval.| Tag | Recommended | Description |
|---|---|---|
v7_q8 |
Yes | Latest. Best retrieval quality; requires the v7 SFT prompt template. |
v4_q8 |
Compact | Smaller output schema; requires the v4 prompt template. |
v1_q8 |
Compatible | Works with the original OpenViking intent-analysis prompt. |
Install Ollama:
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
ollama --version
Pull the model:
# Recommended (latest)
ollama pull guoxuter/ov_intent_analysis_sft:v7_q8
# Compact schema
ollama pull guoxuter/ov_intent_analysis_sft:v4_q8
Call with the Ollama API:
curl http://127.0.0.1:11434/api/generate -d '{
"model": "guoxuter/ov_intent_analysis_sft:v7_q8",
"prompt": "<your rendered v7 prompt>",
"stream": false,
"think": false,
"format": "json",
"options": {
"temperature": 0,
"num_predict": 1024
}
}'
Production note: the model was not trained with thinking mode. Set
"think": falseto avoid extra latency.
v7_q8 returns a single JSON object:
{
"queries": [
{
"query": "RFC standard template",
"context_type": "resource",
"priority": 1
}
]
}
If no retrieval is needed:
{
"queries": []
}
| Type | Meaning | Query Style |
|---|---|---|
skill |
Executable capability, tool, function, API, automation | Imperative verb phrase, e.g. Create RFC document |
resource |
Knowledge artifact, document, spec, guide, code, config | Noun phrase, e.g. RFC standard template |
memory |
User preference or agent execution experience | User's ..., Experience executing ..., or System insights about ... |
metadata:
id: "retrieval.intent_analysis_v7_sft"
name: "Intent Analysis 7 SFT"
description: "Analyze session context to generate query plans for different context types. SFT deploy-time schema without reasoning or per-query intent."
version: "7.0.0"
language: "en"
category: "retrieval"
template: |
You are OpenViking's context query planner, responsible for analyzing task context gaps and generating queries.
## Session Context
### Session Summary
{{ compression_summary }}
### Recent Conversation
{{ recent_messages }}
### Current Message
{{ current_message }}
{% if context_type %}
## Search Scope Constraints
**Restricted Context Type**: {{ context_type }}
{% if target_abstract %}
**Target Directory Abstract**: {{ target_abstract }}
{% endif %}
**Important**: You can only generate `{{ context_type }}` type queries, do not generate other types.
{% endif %}
## Your Task
Analyze the current task, identify context gaps, and generate queries to fill in the required information.
**Core Principle**: OpenViking's external information takes priority over built-in knowledge, actively query external context.
## Context Types and Query Styles
OpenViking supports the following context types, **each type has a different query style**:
### 1. skill (Execution Capability)
**Purpose**: Executable tools, functions, APIs, automation scripts
**When to Query**:
- Task contains action verbs (create, generate, write, build, analyze, process)
- Need to perform specific operations
### 2. resource (Knowledge Resources)
**Purpose**: Documents, specifications, guides, code, configurations, and other structured knowledge
**When to Query**:
- Need reference materials, templates, specifications
- Need to understand knowledge, concepts, definitions
### 3. memory (User/Agent Memory)
**Purpose**: User personalization information or Agent execution experience
**When to Query**:
- Need personalized customization (user memory)
- Need to learn from historical experience (agent memory)
## Analysis Method
### Step 1: Identify Task Type
**Operational Tasks** (containing actions):
- Characteristics: Verbs like create, generate, write, build, transform, calculate, analyze, process
- Typical context combination: `skill + resource + memory`
**Informational Tasks** (acquiring knowledge):
- Characteristics: What is, how to understand, why, concept explanation, etc.
- Typical context combination: `resource + memory`
**Conversational Tasks** (small talk):
- Characteristics: Greetings, small talk, confirmation of understanding, etc.
- Usually no query needed
### Step 2: Check Context Coverage
Analyze whether the session context (summary + recent conversation) already contains the information needed to complete the task:
- **Fully covered**: Skip queries for that type
- **Partially covered**: Generate supplementary queries
- **Not covered**: Generate complete queries
**Note**: Only skip information that has been **explicitly and in detail** discussed in the context.
### Step 3: Generate Queries
**Important Principles**:
1. **Don't over-transform**:
- ❌ Don't convert "Create XX" to "XX format/specification"
2. **Multi-type combination**:
- A task may require multiple context types
- Operational tasks typically need: skill (execution) + resource (reference) + memory (preference/experience)
3. **Multiple queries per type**:
- Can generate multiple queries for the same type
- Maximum 5 queries
4. **Queries should be concise and specific**:
- Queries should be short, specific, and retrievable
- Avoid lengthy descriptions
5. **Priority setting**:
- 1 = Highest priority (core requirement)
- 3 = Medium priority (helpful)
- 5 = Lowest priority (optional)
6. **Query Style** (optimize for vector / semantic retrieval):
- Queries are embedded and matched against indexed content by **semantic similarity**. Write each query so its embedding lands close to the target content — not necessarily a verbatim fragment, any phrasing that captures the same meaning works.
- **Declarative, not interrogative**: state the information need as a noun/verb phrase rather than a question. Drop question framings ("what / who / when / how is ...").
- **One information need per query**: each query targets one retrievable fact, relation, comparison, event, or procedure. Do not pile unrelated information needs into one query.
- **Self-contained**: resolve pronouns and references using the session context; the retriever only sees the query string.
- **Concept-dense and natural**: use a grammatical, well-formed phrase carrying the key entities, attributes, and qualifiers. Avoid both bare single keywords and telegraphic word-salad.
- **No retrieval-meta words**: exclude words describing the act of retrieval or generic containers ("find", "search", "records", "information about", "content", "details", etc.) — they do not appear in the target content and only dilute the embedding.
- **Keep discriminative specifics**: preserve names, dates, places, and domain terms from the task — they anchor the embedding to the right content.
## Output Format
{
"queries": [
{
"query": "Specific query text (following the style of the corresponding type)",
"context_type": "skill|resource|memory",
"priority": 1-5
}
]
}
Please output JSON:
llm_config:
temperature: 0.1
import json
import requests
OLLAMA_URL = "http://127.0.0.1:11434/api/generate"
MODEL = "guoxuter/ov_intent_analysis_sft:v7_q8"
payload = {
"model": MODEL,
"prompt": "<your rendered v7 prompt>",
"stream": False,
"think": False,
"format": "json",
"options": {
"temperature": 0,
"num_predict": 1024,
},
}
response = requests.post(OLLAMA_URL, json=payload, timeout=120)
response.raise_for_status()
body = response.json()
result = json.loads(body["response"])
print(json.dumps(result, ensure_ascii=False, indent=2))
Benchmark environment: MacBook Pro, Apple M2 Pro, 12-core CPU (8 performance + 4 efficiency), 19-core GPU, 32 GB memory.
| Model / Method | Locomo Accuracy | ChitChat F1 | GPU Time | CPU Time | Quantization |
|---|---|---|---|---|---|
| doubao-seed-2.0-pro | 0.9032 | 0.9176 | - | - | None |
| qwen3.5-0.8b base | - | 0.1556 | 7.78 | 12.74 | 8 bit |
v1_q8 |
0.8955 | 0.9070 | 6.95 | 12.13 | 8 bit |
v4_q8 |
0.8890 | 0.9176 | 2.86 | 5.57 | 8 bit |
v7_q8 |
0.9037 | 0.9176 | 2.80 | 5.80 | 8 bit |
v7_q8Locomo accuracy is the mean of 3 runs (0.9039 / 0.9045 / 0.9026; variance ±0.1pp), evaluated end-to-end inside OpenViking (intent →searchretrieval → GPT-5.4 answer → LLM judge). ChitChat F1 is measured on the WOT chitchat-vs-task benchmark. GPU/CPU Time is the mean per-request latency (seconds).
v7_q8 for new integrations: best retrieval quality, with latency on par with v4_q8.v7_q8, v4 prompt for v4_q8, original prompt for v1_q8.temperature to 0.1 for deterministic JSON output.format to "json" to reduce parsing failures."think": false in production.num_predict if the rendered prompt is long.find calls with intent-aware search planning.