LESSTHANSUPER/CLOUDY_MIXTRAL-22b:IQ3_XS

LESSTHANSUPER/ CLOUDY_MIXTRAL-22b:IQ3_XS

265 Downloads Updated 2 months ago

Dual-expert roleplaying model. Made by Cloudyu (Huggingface).

tools

ollama run LESSTHANSUPER/CLOUDY_MIXTRAL-22b:IQ3_XS

curl http://localhost:11434/api/chat \
  -d '{
    "model": "LESSTHANSUPER/CLOUDY_MIXTRAL-22b:IQ3_XS",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from ollama import chat

response = chat(
    model='LESSTHANSUPER/CLOUDY_MIXTRAL-22b:IQ3_XS',
    messages=[{'role': 'user', 'content': 'Hello!'}],
)
print(response.message.content)

import ollama from 'ollama'

const response = await ollama.chat({
  model: 'LESSTHANSUPER/CLOUDY_MIXTRAL-22b:IQ3_XS',
  messages: [{role: 'user', content: 'Hello!'}],
})
console.log(response.message.content)

Details

Updated 2 months ago

2 months ago

83d83f8f51e5 · 8.8GB ·

model

archllama

·

parameters21.5B

·

quantizationIQ3_XS

8.8GB

system

Write {{char}}'s next reply in this fictional roleplay with {{user}}.

69B

template

{{- if .Suffix }}<|fim_prefix|>{{ .Prompt }}<|fim_suffix|>{{ .Suffix }}<|fim_middle|> {{- else if .M

1.6kB

Readme

MIXTRAL MOE / I-MATRIX / 22B (2X13B) / I-QUANT

Contender for the “go-to” storytelling/roleplay model I have come across in my search for the best model (a fool’s errand). Although it loads in all 22-billion parameters into memory, only a set of 13 will be accessed, this not only drastically increases speed, but decreases the total size on disk - it has an effective 26-billion parameters, and virtually 22. In cases where the context needs to be constantly rewritten in its entirety, or the generation speed for standard models is low (using multiple GPUs without a high speed interconnect, for instance), an MoE (Mixture of Experts) model may prove beneficial. That, and its Mixtral base model will output a unique prose like 22-24b models, but at higher speed. To stuff as many parameters in as little VRAM as possible, weighted K and I-quants will be listed. Whenever model size allows, quantizations will be picked to fit within 8, 10, 12, and 16GB GPUs.

Note that I-quants forfeit some token generation speed relative to K-quants in exchange for storage efficiency. With its storage efficiency, the 5-bit quantization can fit inside the VRAM of a 16GB GPU. These models were taken from GGUF formats from Huggingface.

Original model (cloudyu):

GGUF weighted quantizations (mradermacher):

[No obligatory model picture. Ollama would not like it.]

**MIXTRAL MOE / I-MATRIX / 22B (2X13B) / I-QUANT**

Contender for the "go-to" storytelling/roleplay model I have come across in my search for the best model (a fool's errand). Although it loads in all 22-billion parameters into memory, only a set of 13 will be accessed, this not only drastically increases speed, but decreases the total size on disk - it has an _effective_ 26-billion parameters, and virtually 22. In cases where the context needs to be constantly rewritten in its entirety, or the generation speed for standard models is low (using multiple GPUs without a high speed interconnect, for instance), an MoE (Mixture of Experts) model may prove beneficial. That, and its Mixtral base model will output a unique prose like 22-24b models, but at higher speed. To stuff as many parameters in as little VRAM as possible, weighted K and I-quants will be listed. Whenever model size allows, quantizations will be picked to fit within 8, 10, 12, and 16GB GPUs.

Note that I-quants forfeit some token generation speed relative to K-quants in exchange for storage efficiency. With its storage efficiency, the 5-bit quantization can fit inside the VRAM of a 16GB GPU. These models were taken from GGUF formats from Huggingface.

[*Original model (cloudyu):*](https://huggingface.co/cloudyu/Mixtral_13Bx2_MOE_22B)

[*GGUF weighted quantizations (mradermacher):*](https://huggingface.co/mradermacher/Mixtral_13Bx2_MOE_22B-i1-GGUF)

_[No obligatory model picture. Ollama would not like it.]_

Paste, drop or click to upload images (.png, .jpeg, .jpg, .svg, .gif)