91 3 months ago

1.5b
ollama run sam860/falcon-h1:1.5b

Models

View all →

Readme

Notes

Uploaded in Q4_0 and Q8_0 formats.

  • Q4_0 – the lowest‑bit version that still retains most of the original quality; good for CPU‑only inference on modest RAM (≈1 GB).
  • Q8_0 – higher‑bit, slightly better fidelity; use when you have a bit more memory or need the absolute best output.

Temperature: The model was trained for fairly deterministic behavior. Start with 0.1 – 0.2 for reliable answers; increase to ≈0.6 only if you want more creative or exploratory output.


Description

Falcon‑H1‑1.5B‑Deep‑Instruct – a 1.5 B‑parameter hybrid model that combines a classic decoder‑only transformer stack with Mamba (state‑space) blocks.

Key architectural highlights:

  • Hybrid Transformer + Mamba: Alternating transformer layers and Mamba (S4) layers give strong sequence modeling while keeping compute low.
  • Efficient inference: The mixed architecture enables fast token generation on CPUs and NPUs, making the model well‑suited for edge devices.
  • Multilingual: Primarily English but trained on a multilingual corpus, so it handles many languages reasonably well.
  • Instruction‑tuned: Optimized for chat, tool‑calling, and structured JSON output.

Ideal for:

  • On‑device assistants and chatbots
  • Retrieval‑augmented generation (RAG) pipelines
  • Structured data extraction / JSON generation
  • Lightweight code completion (FIM)

References

Falcon‑H1 release blogpost

Technical report (arXiv 2507.22448)

Model card on HuggingFace

Discord community