71 Downloads Updated 4 months ago
Updated 4 months ago
4 months ago
8a4705e9e938 · 2.0GB
aquif-moe-800m is our first Mixture of Experts (MoE) model, with only 800 million active parameters. Despite its compact size, it delivers exceptional performance-per-VRAM efficiency compared to larger models.
aquif-moe-800m
aquif-moe-800m demonstrates state-of-the-art performance across multiple benchmarks, especially when considering its parameter efficiency:
Benchmark | aquif-moe (0.8b) | Llama 3.2 (1b) | Gemma 3 (4b) |
---|---|---|---|
MMLU | 52.2 | 49.3 | 59.6 |
HumanEval | 37.5 | 22.6 | 36.0 |
GSM8K | 49.0 | 44.4 | 38.4 |
Average | 46.2 | 38.8 | 44.7 |
One of aquif-moe-800m’s standout features is its exceptional VRAM efficiency:
Model | Average Performance | VRAM (GB) | Performance per VRAM |
---|---|---|---|
aquif-moe | 46.2 | 0.8 | 57.8 |
Llama 3.2 | 38.8 | 1.2 | 32.3 |
Gemma 3 | 44.7 | 4.3 | 10.4 |
To run via Ollama:
ollama run aquiffoo/aquif-moe-800m
The aquif-moe-800m leverages a Mixture of Experts architecture to achieve high parameter efficiency. While the total parameter count is larger, only 800 million parameters are activated during inference, allowing for significantly reduced VRAM requirements while maintaining competitive performance.
The model’s exceptional VRAM efficiency makes it particularly valuable for enterprise deployments: - Concurrent Sessions: Run multiple model instances on a single GPU - High Throughput: Serve more users with the same hardware footprint - Cost Efficiency: Lower infrastructure costs for production deployments - Scalability: Easier horizontal scaling across available resources
The 128K context window enables comprehensive document analysis while maintaining the model’s efficient resource utilization, making it suitable for enterprises dealing with large documents or extended conversations.
*Note: All performance metrics are approximated estimates based on internal evaluations.