๐ LLaVA is a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding. Updated to version 1.6.
vision
7b
13b
34b
2.3M Pulls Updated 11 months ago
f02dd72bb242 ยท 59B
{
"stop": [
"<|im_start|>",
"<|im_end|>"
]
}