407 Downloads Updated 2 weeks ago
ollama run MedAIBase/Qwen3-VL-Embedding:2b-q8_0
Qwen3-VL-Embedding-2B
Highlights
The Qwen3-VL-Embedding and Qwen3-VL-Reranker model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities.
While the Embedding model generates high-dimensional vectors for broad applications like retrieval and clustering, the Reranker model is engineered to refine these results, establishing a comprehensive pipeline for state-of-the-art multimodal search.
Model Overview
Qwen3-VL-Embedding-2B has the following features:
For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub.
Qwen3-VL-Embedding and Qwen3-VL-Reranker Model list
| Model | Size | Model Layers | Sequence Length | Embedding Dimension | Quantization Support | MRL Support | Instruction Aware |
|---|---|---|---|---|---|---|---|
| Qwen3-VL-Embedding-2B | 2B | 28 | 32K | 2048 | Yes | Yes | Yes |
| Qwen3-VL-Embedding-8B | 8B | 36 | 32K | 4096 | Yes | Yes | Yes |
| Qwen3-VL-Reranker-2B | 2B | 28 | 32K | - | - | - | Yes |
| Qwen3-VL-Reranker-8B | 8B | 36 | 32K | - | - | - | Yes |
Note:
Model Performance
Evaluation Results on [MMEB-V2](https://huggingface.co/spaces/TIGER-Lab/MMEB-Leaderboard)**
Results on the MMEB-V2 benchmark. All models except IFM-TTE have been re-evaluated on the updated VisDoc OOD split. CLS: classification, QA: question answering, RET: retrieval, GD: grounding, MRET: moment retrieval, VDR: ViDoRe, VR: VisRAG, OOD: out-of-distribution.
| Model | Model Size | Image CLS | Image QA | Image RET | Image GD | Image Overall | Video CLS | Video QA | Video RET | Video MRET | Video Overall | VisDoc VDRv1 | VisDoc VDRv2 | VisDoc VR | VisDoc OOD | VisDoc Overall | All |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| # of Datasets → | 10 | 10 | 12 | 4 | 36 | 5 | 5 | 5 | 3 | 18 | 10 | 4 | 6 | 4 | 24 | 78 | |
| VLM2Vec | 2B | 58.7 | 49.3 | 65.0 | 72.9 | 59.7 | 33.4 | 30.5 | 20.6 | 30.7 | 28.6 | 49.8 | 13.5 | 51.8 | 48.2 | 44.0 | 47.7 |
| VLM2Vec-V2 | 2B | 62.9 | 56.3 | 69.5 | 77.3 | 64.9 | 39.3 | 34.3 | 28.8 | 36.8 | 34.6 | 75.5 | 44.9 | 79.4 | 62.2 | 69.2 | 59.2 |
| GME-2B | 2B | 54.4 | 29.9 | 66.9 | 55.5 | 51.9 | 34.9 | 42.0 | 25.6 | 31.1 | 33.6 | 86.1 | 54.0 | 82.5 | 67.5 | 76.8 | 55.3 |
| GME-7B | 7B | 57.7 | 34.7 | 71.2 | 59.3 | 56.0 | 37.4 | 50.4 | 28.4 | 37.0 | 38.4 | 89.4 | 55.6 | 85.0 | 68.3 | 79.3 | 59.1 |
| Ops-MM-embedding-v1 | 8B | 69.7 | 69.6 | 73.1 | 87.2 | 72.7 | 59.7 | 62.2 | 45.7 | 43.2 | 53.8 | 80.1 | 59.6 | 79.3 | 67.8 | 74.4 | 68.9 |
| IFM-TTE | 8B | 76.7 | 78.5 | 74.6 | 89.3 | 77.9 | 60.5 | 67.9 | 51.7 | 54.9 | 59.2 | 85.2 | 71.5 | 92.7 | 53.3 | 79.5 | 74.1 |
| RzenEmbed | 8B | 70.6 | 71.7 | 78.5 | 92.1 | 75.9 | 58.8 | 63.5 | 51.0 | 45.5 | 55.7 | 89.7 | 60.7 | 88.7 | 69.9 | 81.3 | 72.9 |
| Seed-1.6-embedding-1215 | unknown | 75.0 | 74.9 | 79.3 | 89.0 | 78.0 | 85.2 | 66.7 | 59.1 | 54.8 | 67.7 | 90.0 | 60.3 | 90.0 | 70.7 | 82.2 | 76.9 |
| Qwen3-VL-Embedding-2B | 2B | 70.3 | 74.3 | 74.8 | 88.5 | 75.0 | 71.9 | 64.9 | 53.9 | 53.3 | 61.9 | 84.4 | 65.3 | 86.4 | 69.4 | 79.2 | 73.2 |
| Qwen3-VL-Embedding-8B | 8B | 74.2 | 81.1 | 80.2 | 92.3 | 80.1 | 78.4 | 71.0 | 58.7 | 56.1 | 67.1 | 87.2 | 69.9 | 88.7 | 73.3 | 82.4 | 77.8 |
Evaluation Results on [MMTEB](https://huggingface.co/spaces/mteb/leaderboard)**
Results on the MMTEB benchmark.
| Model | Size | Mean (Task) | Mean (Type) | Bitxt Mining | Class. | Clust. | Inst. Retri. | Multi. Class. | Pair. Class. | Rerank | Retri. | STS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NV-Embed-v2 | 7B | 56.29 | 49.58 | 57.84 | 57.29 | 40.80 | 1.04 | 18.63 | 78.94 | 63.82 | 56.72 | 71.10 |
| GritLM-7B | 7B | 60.92 | 53.74 | 70.53 | 61.83 | 49.75 | 3.45 | 22.77 | 79.94 | 63.78 | 58.31 | 73.33 |
| BGE-M3 | 0.6B | 59.56 | 52.18 | 79.11 | 60.35 | 40.88 | -3.11 | 20.1 | 80.76 | 62.79 | 54.60 | 74.12 |
| multilingual-e5-large-instruct | 0.6B | 63.22 | 55.08 | 80.13 | 64.94 | 50.75 | -0.40 | 22.91 | 80.86 | 62.61 | 57.12 | 76.81 |
| gte-Qwen2-1.5B-instruct | 1.5B | 59.45 | 52.69 | 62.51 | 58.32 | 52.05 | 0.74 | 24.02 | 81.58 | 62.58 | 60.78 | 71.61 |
| gte-Qwen2-7b-Instruct | 7B | 62.51 | 55.93 | 73.92 | 61.55 | 52.77 | 4.94 | 25.48 | 85.13 | 65.55 | 60.08 | 73.98 |
| text-embedding-3-large | - | 58.93 | 51.41 | 62.17 | 60.27 | 46.89 | -2.68 | 22.03 | 79.17 | 63.89 | 59.27 | 71.68 |
| Cohere-embed-multilingual-v3.0 | - | 61.12 | 53.23 | 70.50 | 62.95 | 46.89 | -1.89 | 22.74 | 79.88 | 64.07 | 59.16 | 74.80 |
| Gemini Embedding | - | 68.37 | 59.59 | 79.28 | 71.82 | 54.59 | 5.18 | 29.16 | 83.63 | 65.58 | 67.71 | 79.40 |
| Qwen3-Embedding-0.6B | 0.6B | 64.33 | 56.00 | 72.22 | 66.83 | 52.33 | 5.09 | 24.59 | 80.83 | 61.41 | 64.64 | 76.17 |
| Qwen3-Embedding-4B | 4B | 69.45 | 60.86 | 79.36 | 72.33 | 57.15 | 11.56 | 26.77 | 85.05 | 65.08 | 69.60 | 80.86 |
| Qwen3-Embedding-8B | 8B | 70.58 | 61.69 | 80.89 | 74.00 | 57.65 | 10.06 | 28.66 | 86.40 | 65.63 | 70.88 | 81.08 |
| Qwen3-VL-Embedding-2B | 2B | 63.87 | 55.84 | 69.51 | 65.86 | 52.50 | 3.87 | 26.08 | 78.50 | 64.80 | 67.12 | 74.29 |
| Qwen3-VL-Embedding-8B | 8B | 67.88 | 58.88 | 77.48 | 71.95 | 55.82 | 4.46 | 28.59 | 81.08 | 65.72 | 69.41 | 75.41 |
Usage
transformers>=4.57.0
qwen-vl-utils>=0.0.14
torch==2.8.0
Basic Usage Example
from scripts.qwen3_vl_embedding import Qwen3VLEmbedder
import numpy as np
import torch
# Define a list of query texts
queries = [
{“text”: “A woman playing with her dog on a beach at sunset.”},
{“text”: “Pet owner training dog outdoors near water.”},
{“text”: “Woman surfing on waves during a sunny day.”},
{“text”: “City skyline view from a high-rise building at night.”}
]
# Define a list of document texts and images
documents = [
{“text”: “A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.”},
{“image”: “https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg”},
{“text”: “A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.”, “image”: “https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg”}
]
# Specify the model path
model_name_or_path = “Qwen/Qwen3-VL-Embedding-2B”
# Initialize the Qwen3VLEmbedder model
model = Qwen3VLEmbedder(model_name_or_path=model_name_or_path)
# We recommend enabling flash_attention_2 for better acceleration and memory saving,
# model = Qwen3VLEmbedder(model_name_or_path=model_name_or_path, torch_dtype=torch.float16, attn_implementation=“flash_attention_2”)
# Combine queries and documents into a single input list
inputs = queries + documents
# Process the inputs to get embeddings
embeddings = model.process(inputs)
# Compute similarity scores between query embeddings and document embeddings
similarity_scores = (embeddings[:4] @ embeddings[4:].T)
# Print out the similarity scores in a list format
print(similarity_scores.tolist())
# [[0.8157786130905151, 0.7178360223770142, 0.7173429131507874], [0.5195091962814331, 0.3302568793296814, 0.4391537308692932], [0.3884059488773346, 0.285782128572464, 0.33141762018203735], [0.1092604324221611, 0.03871120512485504, 0.06952016055583954]]
For more usage examples, please visit our GitHub repository.
vLLM Basic Usage Example
import argparse
import numpy as np
import os
from typing import List, Dict, Any
from vllm import LLM, EngineArgs
from vllm.multimodal.utils import fetch_image
# Define a list of query texts
queries = [
{“text”: “A woman playing with her dog on a beach at sunset.”},
{“text”: “Pet owner training dog outdoors near water.”},
{“text”: “Woman surfing on waves during a sunny day.”},
{“text”: “City skyline view from a high-rise building at night.”}
]
# Define a list of document texts and images
documents = [
{“text”: “A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.”},
{“image”: “https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg”},
{“text”: “A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.”, “image”: “https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg”}
]
def format_input_to_conversation(input_dict: Dict[str, Any], instruction: str = “Represent the user’s input.”) -> List[Dict]:
content = []
text = input_dict.get(‘text’)
image = input_dict.get(‘image’)
if image:
image_content = None
if isinstance(image, str):
if image.startswith((‘http’, ‘https’, ‘oss’)):
image_content = image
else:
abs_image_path = os.path.abspath(image)
image_content = ‘file://’ + abs_image_path
else:
image_content = image
if image_content:
content.append({
‘type’: ‘image’,
‘image’: image_content,
})
if text:
content.append({‘type’: ‘text’, ‘text’: text})
if not content:
content.append({‘type’: ‘text’, ‘text’: “”})
conversation = [
{“role”: “system”, “content”: [{“type”: “text”, “text”: instruction}]},
{“role”: “user”, “content”: content}
]
return conversation
def prepare_vllm_inputs(input_dict: Dict[str, Any], llm, instruction: str = “Represent the user’s input.”) -> Dict[str, Any]:
text = input_dict.get(‘text’)
image = input_dict.get(‘image’)
conversation = format_input_to_conversation(input_dict, instruction)
prompt_text = llm.llm_engine.tokenizer.apply_chat_template(
conversation,
tokenize=False,
add_generation_prompt=True
)
multi_modal_data = None
if image:
if isinstance(image, str):
if image.startswith((‘http’, ‘https’, ‘oss’)):
try:
image_obj = fetch_image(image)
multi_modal_data = {“image”: image_obj}
except Exception as e:
print(f”Warning: Failed to fetch image {image}: {e}“)
else:
abs_image_path = os.path.abspath(image)
if os.path.exists(abs_image_path):
from PIL import Image
image_obj = Image.open(abs_image_path)
multi_modal_data = {“image”: image_obj}
else:
print(f”Warning: Image file not found: {abs_image_path}“)
else:
multi_modal_data = {“image”: image}
result = {
“prompt”: prompt_text,
“multi_modal_data”: multi_modal_data
}
return result
def main():
parser = argparse.ArgumentParser(description=“Offline Similarity Check with vLLM”)
parser.add_argument(“–model-path”, type=str, default=“Qwen/Qwen3-VL-Embedding-2B”, help=“Path to the model”)
parser.add_argument(“–dtype”, type=str, default=“bfloat16”, help=“Data type (e.g., bfloat16)”)
args = parser.parse_args()
print(f”Loading model from {args.model_path}…“)
engine_args = EngineArgs(
model=args.model_path,
runner=“pooling”,
dtype=args.dtype,
trust_remote_code=True,
)
llm = LLM(**vars(engine_args))
all_inputs = queries + documents
vllm_inputs = [prepare_vllm_inputs(inp, llm) for inp in all_inputs]
outputs = llm.embed(vllm_inputs)
embeddings_list = []
for i, output in enumerate(outputs):
emb = output.outputs.embedding
embeddings_list.append(emb)
print(f”Input {i} embedding shape: {len(emb)}“)
embeddings = np.array(embeddings_list)
print(f”\nEmbeddings shape: {embeddings.shape}“)
num_queries = len(queries)
query_embeddings = embeddings[:num_queries]
doc_embeddings = embeddings[num_queries:]
similarity_scores = query_embeddings @ doc_embeddings.T
print(”\nSimilarity Scores:“)
print(similarity_scores.tolist())
if __name__ == “__main__”:
main()
SGLang Basic Usage Example
import argparse
import numpy as np
import torch
import os
from typing import List, Dict, Any
from sglang.srt.entrypoints.engine import Engine
# Define a list of query texts
queries = [
{“text”: “A woman playing with her dog on a beach at sunset.”},
{“text”: “Pet owner training dog outdoors near water.”},
{“text”: “Woman surfing on waves during a sunny day.”},
{“text”: “City skyline view from a high-rise building at night.”}
]
# Define a list of document texts and images
documents = [
{“text”: “A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.”},
{“image”: “https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg”},
{“text”: “A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.”, “image”: “https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg”}
]
def format_input_to_conversation(input_dict: Dict[str, Any], instruction: str = “Represent the user’s input.”) -> List[Dict]:
content = []
text = input_dict.get(‘text’)
image = input_dict.get(‘image’)
if image:
image_content = None
if isinstance(image, str):
if image.startswith((‘http’, ‘oss’)):
image_content = image
else:
abs_image_path = os.path.abspath(image)
image_content = ‘file://’ + abs_image_path
else:
image_content = image
if image_content:
content.append({
‘type’: ‘image’, ‘image’: image_content,
})
if text:
content.append({‘type’: ‘text’, ‘text’: text})
if not content:
content.append({‘type’: ‘text’, ‘text’: “”})
conversation = [
{“role”: “system”, “content”: [{“type”: “text”, “text”: instruction}]},
{“role”: “user”, “content”: content}
]
return conversation
def convert_to_sglang_format(input_dict: Dict[str, Any], engine: Engine, instruction: str = “Represent the user’s input.”) -> Dict[str, Any]:
conversation = format_input_to_conversation(input_dict, instruction)
text_for_api = engine.tokenizer_manager.tokenizer.apply_chat_template(
conversation,
tokenize=False,
add_generation_prompt=True
)
result = {“text”: text_for_api}
image = input_dict.get(‘image’)
if image and isinstance(image, str):
result[“image”] = image
return result
def main():
parser = argparse.ArgumentParser(description=“Offline Similarity Check with SGLang”)
parser.add_argument(“–model-path”, type=str, default=“Qwen/Qwen3-VL-Embedding-2B”, help=“Path to the model”)
parser.add_argument(“–dtype”, type=str, default=“bfloat16”, help=“Data type (e.g., bfloat16)”)
args = parser.parse_args()
print(f”Loading model from {args.model_path}…“)
engine = Engine(
model_path=args.model_path,
is_embedding=True,
dtype=args.dtype,
trust_remote_code=True,
)
inputs = queries + documents
sglang_inputs = [convert_to_sglang_format(inp, engine) for inp in inputs]
print(sglang_inputs[:])
print(f”sglang_inputs: {sglang_inputs}“)
print(f”Processing {len(sglang_inputs)} inputs…“)
prompts = [inp[‘text’] for inp in sglang_inputs]
images = [inp.get(‘image’) for inp in sglang_inputs]
results = engine.encode(prompts, image_data=images)
embeddings_list = []
for res in results:
embeddings_list.append(res[‘embedding’])
embeddings = np.array(embeddings_list)
print(f”Embeddings shape: {embeddings.shape}“)
num_queries = len(queries)
query_embeddings = embeddings[:num_queries]
doc_embeddings = embeddings[num_queries:]
similarity_scores = (query_embeddings @ doc_embeddings.T)
print(”\nSimilarity Scores:“)
print(similarity_scores.tolist())
if __name__ == “__main__”:
main()
Citation
If you find our work helpful, feel free to give us a cite.
@article{qwen3vlembedding,
title={Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking},
author={Li, Mingxin and Zhang, Yanzhao and Long, Dingkun and Chen Keqin and Song, Sibo and Bai, Shuai and Yang, Zhibo and Xie, Pengjun and Yang, An and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang},
journal={arXiv},
year={2026}
}