A lightweight local intent-analysis model built for the OpenViking search API. It detects whether a conversation turn needs context retrieval, skips chitchat to reduce unnecessary memory injection and token usage, and emits structured retrieval queries

Details

Updated 2 months ago

2 months ago

bab5186d07d8 · 812MB ·

model

archqwen35

parameters752M

quantizationQ8_0

812MB

params

{ "stop": [ "<|im_end|>", "<|im_start|>", "<|endoftext|>" ] }

85B

template

{{ if .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}<|im_start|>user {{ .Prompt }}<|

131B

ov_intent_analysis_sft

Local intent-analysis model for OpenViking retrieval planning.

ov_intent_analysis_sft is a lightweight Q8-quantized Ollama model designed for local deployment with OpenViking and the OpenViking OpenClaw Plugin.

Its main purpose is to decide whether a user turn actually needs context retrieval. For small talk, greetings, or turns where the required context is already covered, the model returns an empty query list, helping avoid unnecessary memory injection and reduce token usage. When retrieval is needed, it emits compact JSON queries for OpenViking context types such as skill, resource, and memory.

Highlights

Retrieval refusal: avoids unnecessary retrieval for chitchat or fully-covered context.
Context-aware query rewriting: completes under-specified user queries using recent conversation context.
Compact JSON output: emits structured retrieval queries for skill, resource, and memory.
Local deployment: runs through Ollama in local CPU environments with low integration cost.
Retrieval-optimized v7 prompt: v7_q8 writes declarative, embedding-friendly queries for stronger semantic retrieval.

Available Tags

Tag	Recommended	Description
`v7_q8`	Yes	Latest. Best retrieval quality; requires the v7 SFT prompt template.
`v4_q8`	Compact	Smaller output schema; requires the v4 prompt template.
`v1_q8`	Compatible	Works with the original OpenViking intent-analysis prompt.

Quick Start

Install Ollama:

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

ollama --version

Pull the model:

# Recommended (latest)
ollama pull guoxuter/ov_intent_analysis_sft:v7_q8

# Compact schema
ollama pull guoxuter/ov_intent_analysis_sft:v4_q8

Call with the Ollama API:

curl http://127.0.0.1:11434/api/generate -d '{
  "model": "guoxuter/ov_intent_analysis_sft:v7_q8",
  "prompt": "<your rendered v7 prompt>",
  "stream": false,
  "think": false,
  "format": "json",
  "options": {
    "temperature": 0,
    "num_predict": 1024
  }
}'

Production note: the model was not trained with thinking mode. Set "think": false to avoid extra latency.

Output Format

v7_q8 returns a single JSON object:

{
  "queries": [
    {
      "query": "RFC standard template",
      "context_type": "resource",
      "priority": 1
    }
  ]
}

If no retrieval is needed:

{
  "queries": []
}

Context Types

Type	Meaning	Query Style
`skill`	Executable capability, tool, function, API, automation	Imperative verb phrase, e.g. `Create RFC document`
`resource`	Knowledge artifact, document, spec, guide, code, config	Noun phrase, e.g. `RFC standard template`
`memory`	User preference or agent execution experience	`User's ...`, `Experience executing ...`, or `System insights about ...`

Recommended v7 Prompt Template

metadata:
  id: "retrieval.intent_analysis_v7_sft"
  name: "Intent Analysis 7 SFT"
  description: "Analyze session context to generate query plans for different context types. SFT deploy-time schema without reasoning or per-query intent."
  version: "7.0.0"
  language: "en"
  category: "retrieval"

template: |
  You are OpenViking's context query planner, responsible for analyzing task context gaps and generating queries.

  ## Session Context

  ### Session Summary
  {{ compression_summary }}

  ### Recent Conversation
  {{ recent_messages }}

  ### Current Message
  {{ current_message }}
  {% if context_type %}

  ## Search Scope Constraints

  **Restricted Context Type**: {{ context_type }}
  {% if target_abstract %}
  **Target Directory Abstract**: {{ target_abstract }}
  {% endif %}

  **Important**: You can only generate `{{ context_type }}` type queries, do not generate other types.
  {% endif %}

  ## Your Task

  Analyze the current task, identify context gaps, and generate queries to fill in the required information.

  **Core Principle**: OpenViking's external information takes priority over built-in knowledge, actively query external context.

  ## Context Types and Query Styles

  OpenViking supports the following context types, **each type has a different query style**:

  ### 1. skill (Execution Capability)

  **Purpose**: Executable tools, functions, APIs, automation scripts

  **When to Query**:
  - Task contains action verbs (create, generate, write, build, analyze, process)
  - Need to perform specific operations

  ### 2. resource (Knowledge Resources)

  **Purpose**: Documents, specifications, guides, code, configurations, and other structured knowledge

  **When to Query**:
  - Need reference materials, templates, specifications
  - Need to understand knowledge, concepts, definitions

  ### 3. memory (User/Agent Memory)

  **Purpose**: User personalization information or Agent execution experience

  **When to Query**:
  - Need personalized customization (user memory)
  - Need to learn from historical experience (agent memory)

  ## Analysis Method

  ### Step 1: Identify Task Type

  **Operational Tasks** (containing actions):
  - Characteristics: Verbs like create, generate, write, build, transform, calculate, analyze, process
  - Typical context combination: `skill + resource + memory`

  **Informational Tasks** (acquiring knowledge):
  - Characteristics: What is, how to understand, why, concept explanation, etc.
  - Typical context combination: `resource + memory`

  **Conversational Tasks** (small talk):
  - Characteristics: Greetings, small talk, confirmation of understanding, etc.
  - Usually no query needed

  ### Step 2: Check Context Coverage

  Analyze whether the session context (summary + recent conversation) already contains the information needed to complete the task:

  - **Fully covered**: Skip queries for that type
  - **Partially covered**: Generate supplementary queries
  - **Not covered**: Generate complete queries

  **Note**: Only skip information that has been **explicitly and in detail** discussed in the context.

  ### Step 3: Generate Queries

  **Important Principles**:

  1. **Don't over-transform**:
     - ❌ Don't convert "Create XX" to "XX format/specification"

  2. **Multi-type combination**:
     - A task may require multiple context types
     - Operational tasks typically need: skill (execution) + resource (reference) + memory (preference/experience)

  3. **Multiple queries per type**:
     - Can generate multiple queries for the same type
     - Maximum 5 queries

  4. **Queries should be concise and specific**:
     - Queries should be short, specific, and retrievable
     - Avoid lengthy descriptions

  5. **Priority setting**:
     - 1 = Highest priority (core requirement)
     - 3 = Medium priority (helpful)
     - 5 = Lowest priority (optional)

  6. **Query Style** (optimize for vector / semantic retrieval):
     - Queries are embedded and matched against indexed content by **semantic similarity**. Write each query so its embedding lands close to the target content — not necessarily a verbatim fragment, any phrasing that captures the same meaning works.
     - **Declarative, not interrogative**: state the information need as a noun/verb phrase rather than a question. Drop question framings ("what / who / when / how is ...").
     - **One information need per query**: each query targets one retrievable fact, relation, comparison, event, or procedure. Do not pile unrelated information needs into one query.
     - **Self-contained**: resolve pronouns and references using the session context; the retriever only sees the query string.
     - **Concept-dense and natural**: use a grammatical, well-formed phrase carrying the key entities, attributes, and qualifiers. Avoid both bare single keywords and telegraphic word-salad.
     - **No retrieval-meta words**: exclude words describing the act of retrieval or generic containers ("find", "search", "records", "information about", "content", "details", etc.) — they do not appear in the target content and only dilute the embedding.
     - **Keep discriminative specifics**: preserve names, dates, places, and domain terms from the task — they anchor the embedding to the right content.

  ## Output Format
  {
      "queries": [
          {
              "query": "Specific query text (following the style of the corresponding type)",
              "context_type": "skill|resource|memory",
              "priority": 1-5
          }
      ]
  }
  Please output JSON:

llm_config:
  temperature: 0.1

Python Example

import json
import requests

OLLAMA_URL = "http://127.0.0.1:11434/api/generate"
MODEL = "guoxuter/ov_intent_analysis_sft:v7_q8"

payload = {
    "model": MODEL,
    "prompt": "<your rendered v7 prompt>",
    "stream": False,
    "think": False,
    "format": "json",
    "options": {
        "temperature": 0,
        "num_predict": 1024,
    },
}

response = requests.post(OLLAMA_URL, json=payload, timeout=120)
response.raise_for_status()

body = response.json()
result = json.loads(body["response"])
print(json.dumps(result, ensure_ascii=False, indent=2))

Benchmarks

Benchmark environment: MacBook Pro, Apple M2 Pro, 12-core CPU (8 performance + 4 efficiency), 19-core GPU, 32 GB memory.

Model / Method	Locomo Accuracy	ChitChat F1	GPU Time	CPU Time	Quantization
doubao-seed-2.0-pro	0.9032	0.9176	-	-	None
qwen3.5-0.8b base	-	0.1556	7.78	12.74	8 bit
`v1_q8`	0.8955	0.9070	6.95	12.13	8 bit
`v4_q8`	0.8890	0.9176	2.86	5.57	8 bit
`v7_q8`	0.9037	0.9176	2.80	5.80	8 bit

v7_q8 Locomo accuracy is the mean of 3 runs (0.9039 / 0.9045 / 0.9026; variance ±0.1pp), evaluated end-to-end inside OpenViking (intent → search retrieval → GPT-5.4 answer → LLM judge). ChitChat F1 is measured on the WOT chitchat-vs-task benchmark. GPU/CPU Time is the mean per-request latency (seconds).

Best Practices

Use v7_q8 for new integrations: best retrieval quality, with latency on par with v4_q8.
Match the prompt template to the tag: v7 SFT prompt for v7_q8, v4 prompt for v4_q8, original prompt for v1_q8.
Set temperature to 0.1 for deterministic JSON output.
Set format to "json" to reduce parsing failures.
Set "think": false in production.
Increase num_predict if the rendered prompt is long.
Treat this as a retrieval-planning model, not a general-purpose chat model.

Typical Use Cases

Replace unnecessary find calls with intent-aware search planning.
Generate memory/resource/skill queries from incomplete user messages.
Reduce token cost by refusing retrieval for conversational turns.
Improve OpenViking memory injection quality with context-aware query expansion.