28 3 weeks ago

A lightweight local intent-analysis model built for the OpenViking search API. It detects whether a conversation turn needs context retrieval, skips chitchat to reduce unnecessary memory injection and token usage, and emits structured retrieval queries

tools thinking
ollama run guoxuter/ov_intent_analysis_sft:v4_q8

Details

3 weeks ago

bab5186d07d8 · 812MB ·

qwen35
·
752M
·
Q8_0
{{ if .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}<|im_start|>user {{ .Prompt }}<|
{ "stop": [ "<|im_end|>", "<|im_start|>", "<|endoftext|>" ] }

Readme

ov_intent_analysis_sft

Local intent-analysis model for OpenViking retrieval planning.

ov_intent_analysis_sft is a lightweight Q8-quantized Ollama model designed for local deployment with OpenViking and the OpenViking OpenClaw Plugin.

Its main purpose is to decide whether a user turn actually needs context retrieval. For small talk, greetings, or turns where the required context is already covered, the model returns an empty query list, helping avoid unnecessary memory injection and reduce token usage. When retrieval is needed, it emits compact JSON queries for OpenViking context types such as skill, resource, and memory.

Highlights

  • Retrieval refusal: avoids unnecessary retrieval for chitchat or fully-covered context.
  • Context-aware query rewriting: completes under-specified user queries using recent conversation context.
  • Compact JSON output: emits structured retrieval queries for skill, resource, and memory.
  • Local deployment: runs through Ollama in local CPU environments with low integration cost.
  • Optimized v4 prompt: v4_q8 provides lower latency with a smaller output schema.

Available Tags

Tag Recommended Description
v4_q8 Yes Compact output, lower latency, requires the v4 prompt template.
v1_q8 Compatible Works with the original OpenViking intent-analysis prompt.

Quick Start

Install Ollama:

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

ollama --version

Pull the model:

# Recommended
ollama pull guoxuter/ov_intent_analysis_sft:v4_q8

# Compatible legacy version
ollama pull guoxuter/ov_intent_analysis_sft:v1_q8

Call with the Ollama API:

curl http://127.0.0.1:11434/api/generate -d '{
  "model": "guoxuter/ov_intent_analysis_sft:v4_q8",
  "prompt": "<your rendered v4 prompt>",
  "stream": false,
  "think": false,
  "format": "json",
  "options": {
    "temperature": 0,
    "num_predict": 1024
  }
}'

Production note: the model was not trained with thinking mode. Set "think": false to avoid extra latency.

Output Format

v4_q8 returns a single JSON object:

{
  "queries": [
    {
      "query": "RFC standard template",
      "context_type": "resource",
      "priority": 1
    }
  ]
}

If no retrieval is needed:

{
  "queries": []
}

Context Types

Type Meaning Query Style
skill Executable capability, tool, function, API, automation Imperative verb phrase, e.g. Create RFC document
resource Knowledge artifact, document, spec, guide, code, config Noun phrase, e.g. RFC standard template
memory User preference or agent execution experience User's ..., Experience executing ..., or System insights about ...

Recommended v4 Prompt Template

metadata:
  id: "retrieval.intent_analysis"
  name: "Intent Analysis v4"
  description: "v4 prompt for compact intent-analysis models that emit only queries."
  version: "4.0.0"
  language: "en"
  category: "retrieval"

template: |
  You are OpenViking's context query planner. Given the session context and the current message, decide what context information is missing and emit retrieval queries to fill the gap.

  ## Session Context

  ### Session Summary
  {{ compression_summary }}

  ### Recent Conversation
  {{ recent_messages }}

  ### Current Message
  {{ current_message }}
  {% if context_type %}

  ## Search Scope Constraints

  **Restricted Context Type**: {{ context_type }}
  {% if target_abstract %}
  **Target Directory Abstract**: {{ target_abstract }}
  {% endif %}

  Only emit `{{ context_type }}` queries; do not generate other types.
  {% endif %}

  External information takes priority over built-in knowledge - actively query for any missing context.

  ## Procedure

  1. Classify the task - operational tasks typically need skill+resource+memory; informational tasks typically need resource+memory; conversational small talk needs no query.
  2. Skip any context type already covered explicitly in the conversation.
  3. For each needed type, emit 1-5 concise retrievable queries with `priority` from 1 to 5.

  ## Output Format

  Output a single JSON object with exactly one top-level key:

  - `queries`: array of objects with:
    - `query`: actual query text
    - `context_type`: one of `skill`, `resource`, `memory`
    - `priority`: integer from 1 to 5

  If no query is needed, set `queries` to an empty array `[]`.

  Output the JSON object directly. Do not wrap it in markdown code fences.

llm_config:
  temperature: 0.0

Python Example

import json
import requests

OLLAMA_URL = "http://127.0.0.1:11434/api/generate"
MODEL = "guoxuter/ov_intent_analysis_sft:v4_q8"

payload = {
    "model": MODEL,
    "prompt": "<your rendered v4 prompt>",
    "stream": False,
    "think": False,
    "format": "json",
    "options": {
        "temperature": 0,
        "num_predict": 1024,
    },
}

response = requests.post(OLLAMA_URL, json=payload, timeout=120)
response.raise_for_status()

body = response.json()
result = json.loads(body["response"])
print(json.dumps(result, ensure_ascii=False, indent=2))

Benchmarks

Benchmark environment: MacBook Pro, Apple M2 Pro, 12-core CPU (8 performance + 4 efficiency), 19-core GPU, 32 GB memory.

Model / Method Locomo Accuracy ChitChat F1 GPU Time CPU Time Quantization
doubao-seed-2.0-pro 0.9032 0.9176 - - None
qwen3.5-0.8b base - 0.1556 7.78 12.74 8 bit
v1_q8 0.8955 0.9070 6.95 12.13 8 bit
v4_q8 0.8890 0.9176 2.86 5.57 8 bit

Best Practices

  • Use v4_q8 for new integrations.
  • Use the v4 prompt template with v4_q8; use the original prompt only for v1_q8.
  • Set temperature to 0 for deterministic JSON output.
  • Set format to "json" to reduce parsing failures.
  • Set "think": false in production.
  • Increase num_predict if the rendered prompt is long.
  • Treat this as a retrieval-planning model, not a general-purpose chat model.

Typical Use Cases

  • Replace unnecessary find calls with intent-aware search planning.
  • Generate memory/resource/skill queries from incomplete user messages.
  • Reduce token cost by refusing retrieval for conversational turns.
  • Improve OpenViking memory injection quality with context-aware query expansion.