53 1 week ago

embedding 0.6b 4b 8b
ollama pull batiai/qwen3-embedding:4b-q8

Details

1 week ago

cd71286c6233 · 4.3GB ·

qwen3
·
4.02B
·
Q8_0
{{ .Prompt }}
{ "num_ctx": 32768 }

Readme

Qwen3-Embedding — Quantized by BatiAI

Text embedding models for semantic search + RAG. Three size tiers, one namespace. Quantized directly from Alibaba’s BF16 safetensors (not re-quantized), general.author: BatiAI signed. Part of BatiAI’s Mac-first RAG stack.

Tags

Pick by RAM

Tag Quant Size Target Mac
0.6b Q8_0 (default) 610 MB every Mac (8 GB+)
0.6b-q6 Q6_K 472 MB smaller footprint
4b Q8_0 (default) 4.28 GB 16 GB+ Mac (sweet spot)
4b-q6 Q6_K 3.31 GB tighter disk
8b Q8_0 (default) 8.05 GB 24 GB+ Mac (top quality)
8b-q6 Q6_K 6.21 GB 24 GB Mac friendly

Quick pulls

# Default tier — lightweight, runs on every Mac
ollama pull batiai/qwen3-embedding:0.6b

# Balanced mid tier — 16 GB+ Macs
ollama pull batiai/qwen3-embedding:4b

# Top tier — MTEB #1 open embedder at release
ollama pull batiai/qwen3-embedding:8b

Usage (Ollama embeddings API)

curl http://localhost:11434/api/embeddings -d '{
  "model": "batiai/qwen3-embedding:4b",
  "prompt": "semantic search query"
}'

Qwen3-Embedding recommended prompt format — best retrieval quality:

# Query side (use instruction prefix)
prompt = "Instruct: Given a document query, retrieve the most relevant chunk.\n" \
         "Query: " + user_input

# Document side (raw text, no prefix)
prompt = document_chunk

Matryoshka — models output up to 2560 dimensions. Truncate at read time for faster search (default recommendation: 1024 dims). No re-embed needed.

Quality — measured (published)

Four-stage verification harness, EN + KO balanced, 120 items total.

Model Same-lang directional Cross-lingual separation Δ Real-doc top-1 (EN / KO) Q8 ↔ Q6 drift avg
0.6B (Q6) 3030 (100 %) 0.521 95 % / 100 % 0.9967
4B (Q6) 3030 (100 %) 0.540 95 % / 100 % 0.9984
8B (Q6) 3030 (100 %) 0.569 100 % / 100 % 0.9988

Cleanly monotonic improvement with size. 8B achieves perfect top-1 retrieval on the real-document testset for both English and Korean — full value of MTEB #1 visible on realistic business-doc workloads.

Full testset + harness: scripts/bench-embedding-quality.sh in the publisher repo.

Why BatiAI?

  • Direct from Qwen BF16 — not re-quantized from another GGUF
  • general.author: BatiAI metadata for provenance
  • 4-stage quality published — real measured numbers, not marketing
  • Q8_0 + Q6_K only — IQ quants excluded because low-bit cosine drift cascades through vector similarity (embeddings are different from chat LLMs)
  • Matched pairs with Qwen3-Reranker and Qwen3.6-35B-A3B in the same namespace

BatiAI RAG Stack

user query
   ↓ batiai/qwen3-embedding:{0.6b|4b|8b}     ← text embedder
1024-dim vector
   ↓ vector DB (sqlite-vec / LanceDB)
top-K candidates
   ↓ Qwen3-Reranker (via llama-server)
top-3
   ↓ batiai/qwen3.6-35b:iq4                  ← chat LLM
answer

All components in the batiai/ Ollama namespace (or HF org). Direct-from-source, consistent provenance.

Image / screenshot search?

Use the VL variant: batiai/qwen3-vl-embed-2b / :8b. Text-only embedders here are lighter + more accurate for plain text workloads.

HF mirrors (full spectrum + scripts)

Full BatiAI RAG Stack collection →

License

Apache 2.0 — commercial use permitted. BatiAI’s quantization pipeline is MIT.