A lightweight, efficient and ultra powerful MoE model, setting the new standard for aquif AI, available only in Huggingface.

aquif-3-preview

aquif-3 is a lightweight, high-efficiency and ultra-powerful mixture-of-experts model. Using a brand new Mamba-2 hybrid-recurrent architecture, it shows powerful reasoning capabilities and activates only ~1B parameters per forward pass while still delivering competitive results across multiple benchmarks.

Model Overview

Name: aquif-3-preview
Parameters: 6.5 Billion (Mixture-of-Experts)
Active Parameters: ~1 Billion
Architecture: Decoder-only transformer (Hybrid-recurrent MoE, 2-of-8 routing)
Context Window: 128,000 tokens
Type: General-purpose LLM
Hosted on: HuggingFace only (llama.cpp does not support this architecture yet)

Key Features

Sparse, fast & efficient: Uses only ~1B parameters per generation
Thinking Mode: Activates deeper chain-of-thought reasoning
128K context: Supports long conversations, documents, transcripts, and planning
Runs on local machines: Ideal for edge use, low-resource devices, or offline use

Performance Benchmarks

aquif MoE delivers strong performance despite its minimal activation size:

Benchmark	aquif-3.0-preview-2 (2.5B active)	aquif-3-preview (1B active)
MMLU	55.9	60.4
HumanEval	80.5	82.4
GSM8K	72.5	70.1
Average	69.6	71.0

These results reflect internal evaluations on representative test sets. Final scores may vary slightly in public benchmarks.

Thinking Mode

To enhance reasoning, activate “thinking mode” with the following control message before your prompt:

{
  "role": "control",
  "content": "thinking"
}

or you can use thinking as True on your HuggingFace code.

input_ids = tokenizer.apply_chat_template(
	conv,
	return_tensors="pt",
	thinking=True,
	return_dict=True,
	add_generation_prompt=True
).to(device)

This enables internal self-reflection logic and improves multi-step task accuracy.

Getting Started

To run via Huggingface, you need to install IBM’s granitemoe_hybrid_external_cleanup branch instead of regular HF transformers, as aquif-3-preview is a finetune of Granite-4.0-Tiny-Base:

git clone https://github.com/Ssukriti/transformers.git
cd transformers
git checkout granitemoe_hybrid_external_cleanup
pip install -e .

from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
import torch

model_path="aquiffoo/aquif-3-preview"
device="cuda"
model = AutoModelForCausalLM.from_pretrained(
        model_path,
        device_map=device,
        torch_dtype=torch.bfloat16,
    )
tokenizer = AutoTokenizer.from_pretrained(
        model_path
)

conv = [{"role": "user", "content":"Hi!"}]

input_ids = tokenizer.apply_chat_template(conv, return_tensors="pt", thinking=False, return_dict=True, add_generation_prompt=True).to(device)

set_seed(42)
output = model.generate(
    **input_ids,
    max_new_tokens=8192,
)

prediction = tokenizer.decode(output[0, input_ids["input_ids"].shape[1]:], skip_special_tokens=True)
print(prediction)

The future of aquif AI leans towards both dense and Mixture of Experts models, which are smarter and more efficient for inference. We can’t wait to see what you are going to create with aquif-3.