Notes
Uploaded in Q4_0 and Q8_0 formats.
- Q4_0 – the lowest‑bit version that still retains most of the original quality; good for CPU‑only inference on modest RAM (≈1 GB).
- Q8_0 – higher‑bit, slightly better fidelity; use when you have a bit more memory or need the absolute best output.
Temperature: The model was trained for fairly deterministic behavior. Start with 0.1 – 0.2 for reliable answers; increase to ≈0.6 only if you want more creative or exploratory output.
Description
Falcon‑H1‑1.5B‑Deep‑Instruct – a 1.5 B‑parameter hybrid model that combines a classic decoder‑only transformer stack with Mamba (state‑space) blocks.
Key architectural highlights:
- Hybrid Transformer + Mamba: Alternating transformer layers and Mamba (S4) layers give strong sequence modeling while keeping compute low.
- Efficient inference: The mixed architecture enables fast token generation on CPUs and NPUs, making the model well‑suited for edge devices.
- Multilingual: Primarily English but trained on a multilingual corpus, so it handles many languages reasonably well.
- Instruction‑tuned: Optimized for chat, tool‑calling, and structured JSON output.
Ideal for:
- On‑device assistants and chatbots
- Retrieval‑augmented generation (RAG) pipelines
- Structured data extraction / JSON generation
- Lightweight code completion (FIM)
References
Falcon‑H1 release blogpost
Technical report (arXiv 2507.22448)
Model card on HuggingFace
Discord community