27 Downloads Updated 2 days ago
Updated 2 days ago
2 days ago
83d83f8f51e5 · 8.8GB ·
MIXTRAL MOE / I-MATRIX / 22B (2X13B) / I-QUANT
Contender for the “go-to” storytelling/roleplay model I have come across in my search for the best model (a fool’s errand). Although it loads in all 22-billion parameters into memory, only a set of 13 will be accessed, this not only drastically increases speed, but decreases the total size on disk - it has an effective 26-billion parameters, and virtually 22. In cases where the context needs to be constantly rewritten in its entirety, or the generation speed for standard models is low (using multiple GPUs without a high speed interconnect, for instance), an MoE (Mixture of Experts) model may prove beneficial. That, and its Mixtral base model will output a unique prose like 22-24b models, but at higher speed. To stuff as many parameters in as little VRAM as possible, weighted K and I-quants will be listed. Whenever model size allows, quantizations will be picked to fit within 8, 10, 12, and 16GB GPUs.
Note that I-quants forfeit some token generation speed relative to K-quants in exchange for storage efficiency. With its storage efficiency, the 5-bit quantization can fit inside the VRAM of a 16GB GPU. These models were taken from GGUF formats from Huggingface.
GGUF weighted quantizations (mradermacher):
[No obligatory model picture. Ollama would not like it.]