Qwen3-Embedding — Quantized by BatiAI

Text embedding models for semantic search + RAG. Three size tiers, one namespace. Quantized directly from Alibaba’s BF16 safetensors (not re-quantized), general.author: BatiAI signed. Part of BatiAI’s Mac-first RAG stack.

Tag	Quant	Size	Target Mac
`0.6b`	Q8_0 (default)	610 MB	every Mac (8 GB+)
`0.6b-q6`	Q6_K	472 MB	smaller footprint
`4b`	Q8_0 (default)	4.28 GB	16 GB+ Mac (sweet spot)
`4b-q6`	Q6_K	3.31 GB	tighter disk
`8b`	Q8_0 (default)	8.05 GB	24 GB+ Mac (top quality)
`8b-q6`	Q6_K	6.21 GB	24 GB Mac friendly

Usage (Ollama embeddings API)

curl http://localhost:11434/api/embeddings -d '{
  "model": "batiai/qwen3-embedding:4b",
  "prompt": "semantic search query"
}'

Qwen3-Embedding recommended prompt format — best retrieval quality:

# Query side (use instruction prefix)
prompt = "Instruct: Given a document query, retrieve the most relevant chunk.\n" \
         "Query: " + user_input

# Document side (raw text, no prefix)
prompt = document_chunk

Matryoshka — models output up to 2560 dimensions. Truncate at read time for faster search (default recommendation: 1024 dims). No re-embed needed.

Quality — measured (published)

Four-stage verification harness, EN + KO balanced, 120 items total.

Model	Same-lang directional	Cross-lingual separation Δ	Real-doc top-1 (EN / KO)	Q8 ↔ Q6 drift avg
0.6B (Q6)	³⁰⁄₃₀ (100 %)	0.521	95 % / 100 %	0.9967
4B (Q6)	³⁰⁄₃₀ (100 %)	0.540	95 % / 100 %	0.9984
8B (Q6)	³⁰⁄₃₀ (100 %)	0.569	100 % / 100 %	0.9988

Cleanly monotonic improvement with size. 8B achieves perfect top-1 retrieval on the real-document testset for both English and Korean — full value of MTEB #1 visible on realistic business-doc workloads.

Full testset + harness: scripts/bench-embedding-quality.sh in the publisher repo.

Why BatiAI?

Direct from Qwen BF16 — not re-quantized from another GGUF
general.author: BatiAI metadata for provenance
4-stage quality published — real measured numbers, not marketing
Q8_0 + Q6_K only — IQ quants excluded because low-bit cosine drift cascades through vector similarity (embeddings are different from chat LLMs)
Matched pairs with Qwen3-Reranker and Qwen3.6-35B-A3B in the same namespace

BatiAI RAG Stack

user query
   ↓ batiai/qwen3-embedding:{0.6b|4b|8b}     ← text embedder
1024-dim vector
   ↓ vector DB (sqlite-vec / LanceDB)
top-K candidates
   ↓ Qwen3-Reranker (via llama-server)
top-3
   ↓ batiai/qwen3.6-35b:iq4                  ← chat LLM
answer

All components in the batiai/ Ollama namespace (or HF org). Direct-from-source, consistent provenance.

Image / screenshot search?

Use the VL variant: batiai/qwen3-vl-embed-2b / :8b. Text-only embedders here are lighter + more accurate for plain text workloads.

HF mirrors (full spectrum + scripts)

Full BatiAI RAG Stack collection →

License

Apache 2.0 — commercial use permitted. BatiAI’s quantization pipeline is MIT.

Details

Readme