sam860/ falcon-h1:1.5b-deep-Q8_0

91 3 months ago

1.5b
ollama run sam860/falcon-h1:1.5b-deep-Q8_0

Details

3 months ago

caa0703bff10 · 1.7GB ·

falcon-h1
·
1.55B
·
Q8_0
{{ if .System }}<|begin_of_text|><|im_start|>system {{ .System }}<|im_end|> {{ end }}{{ if .Prompt }
{ "stop": [ "<|begin_of_text|>", "<|im_start|>", "<|im_end|>", "

Readme

Notes

Uploaded in Q4_0 and Q8_0 formats.

  • Q4_0 – the lowest‑bit version that still retains most of the original quality; good for CPU‑only inference on modest RAM (≈1 GB).
  • Q8_0 – higher‑bit, slightly better fidelity; use when you have a bit more memory or need the absolute best output.

Temperature: The model was trained for fairly deterministic behavior. Start with 0.1 – 0.2 for reliable answers; increase to ≈0.6 only if you want more creative or exploratory output.


Description

Falcon‑H1‑1.5B‑Deep‑Instruct – a 1.5 B‑parameter hybrid model that combines a classic decoder‑only transformer stack with Mamba (state‑space) blocks.

Key architectural highlights:

  • Hybrid Transformer + Mamba: Alternating transformer layers and Mamba (S4) layers give strong sequence modeling while keeping compute low.
  • Efficient inference: The mixed architecture enables fast token generation on CPUs and NPUs, making the model well‑suited for edge devices.
  • Multilingual: Primarily English but trained on a multilingual corpus, so it handles many languages reasonably well.
  • Instruction‑tuned: Optimized for chat, tool‑calling, and structured JSON output.

Ideal for:

  • On‑device assistants and chatbots
  • Retrieval‑augmented generation (RAG) pipelines
  • Structured data extraction / JSON generation
  • Lightweight code completion (FIM)

References

Falcon‑H1 release blogpost

Technical report (arXiv 2507.22448)

Model card on HuggingFace

Discord community