40 4 days ago

A 152M-parameter language model that runs on almost anything laptops, old PCs, even a Raspberry Pi yet does instruction-following chat, step-by-step reasoning, and web search for fresh facts.

tools thinking 152m
ollama run Alieno/ailo-152m-v2:q8_0

Details

4 days ago

034ee7932adc · 163MB ·

llama
·
152M
·
Q8_0
{ "num_ctx": 512, "num_predict": 256, "repeat_penalty": 1.3, "stop": [ "<|en
{{- if .Tools }}{{ end }}{{- $u := "" }}{{- range .Messages }}{{- if eq .Role "user" }}{{- $u = .Con

Readme

AILO-152M-v2 Tiny LLM with Chat, Reasoning & Web Search ⚡

Screenshot 2026-06-05 121006.png

A 152M-parameter language model that runs on almost anything laptops, old PCs, even a Raspberry Pi yet does instruction-following chat, step-by-step reasoning, and web search for fresh facts.

AILO (Artificial Intelligence Language Operator) is a compact, fast, from-scratch transformer. v2 turns the original base model into a real assistant: it answers questions, thinks before answering, and can use live web results to answer about things it was never trained on.

ollama run Alieno/ailo-152m-v2
🧠 Parameters 151.9M
Speed up to 384 tok/s (GPU), runs on CPU & edge
📦 Size 97 MB (q4_k_m) – 305 MB (f16)
🌐 Web search yes (context-following)
💭 Reasoning yes (<think>)
🪶 Min RAM ~300 MB

✨ Why AILO-152M-v2?

  • Runs anywhere 97 MB quantized, ~300 MB RAM. Old laptops, mini-PCs, Raspberry Pi, phones.
  • Fast fastest in its class (see benchmarks). Real-time chat even on modest hardware.
  • Web-aware trained for context-following, so it answers from fresh search results instead of stale memory.
  • Distilled from a bigger model answers learned from Gemma 3 4B (knowledge distillation): richer, better-structured replies than its size suggests.
  • Honest small model strong at concise factual Q&A and conversation; pairs with tools for exact math.
  • Open & local no cloud, full privacy, drop-in for Ollama.

Great for: edge/on-device AI, offline assistants, learning how LLMs work, fast prototyping, low-power servers, privacy-first chatbots.


🚀 Quick start

Ollama (recommended)

ollama run Alieno/ailo-152m-v2
>>> What is the capital of Italy?
The capital city of Italy is Rome.

Tags: :latest / :q8_0 (best quality, 156 MB) · :q4_k_m (smallest, 97 MB) · :f16 (full precision, 305 MB)

API

curl http://localhost:11434/api/chat -d '{
  "model": "Alieno/ailo-152m-v2",
  "messages": [{"role": "user", "content": "Explain what gravity is."}]
}'

🏆 Benchmarks

Evaluated via Ollama /api/chat on factual QA, reasoning and coherence vs comparable and larger models:

Model Params Factual Reasoning Coherence Speed (tok/s)
AILO-152M-v2 152M 78 1–2/5 100% 384 🥇
SmolLM2 135M 88 15 98% 403
Qwen2.5 500M 88 3–4/5 96% 213
TinyLlama 1.1B 88 1–2/5 97% 260
  • 🥇 Top coherence (100% virtually no repetition) and among the fastest.
  • Competitive on factual accuracy with models its size and larger.
  • Trails only bigger instruction-tuned models on multi-step reasoning expected for the smallest, from-scratch model here.

Measured on an NVIDIA RTX 5060 Ti. Reasoning has run-to-run variance on an 85-question micro-suite.


🖥️ Hardware & performance

AILO-152M is tiny, so it runs far beyond high-end GPUs including old and low-power hardware. Approximate generation speed (q4_k_m, ~97 MB):

Hardware Type Est. speed (tok/s) Notes
RTX 5060 Ti / 4070+ Modern GPU 350–450 ✅ measured: 384 (q8_0)
RTX 3060 / 2070 Mid GPU ~250–350 smooth real-time
GTX 1660 / 1060 Older GPU ~150–220 still real-time
GTX 1050 / MX150 Old laptop GPU ~90–140 very usable
Ryzen 7 / Core i7 (recent) Modern CPU ~45–80 no GPU needed
Core i5 ~2015 Old CPU ~18–30 usable for chat
Raspberry Pi 5 SBC / edge ~10–16 runs offline
Raspberry Pi 4 Low-power SBC ~5–9 runs offline
Recent smartphone Mobile ~15–35 via llama.cpp/Termux

Estimates except the measured RTX 5060 Ti; real numbers vary with quantization, RAM bandwidth and build flags. The takeaway: AILO runs even where larger models can’t load at all.

Minimum requirements: ~300 MB RAM (q4_k_m), any x86-64 / ARM CPU. No GPU required.


💬 Chat format

Trained on this template (tags are plain GPT-2 BPE sequences no vocab extension):

<|user|>
{question}
<|assistant|>
<think>{optional reasoning}</think>
{answer}<|end|>

🌐 Web search (fresh facts)

AILO v2 is trained for context-following with override: give it search results and it answers from them even when they contradict its training-time knowledge, so it can use up-to-date facts. When no context is given, it falls back to its own (true) knowledge.

A ready pipeline is included (ailo_web.py): DuckDuckGo → instant-answer + semantic re-ranking (MiniLM) with language/relevance filters → short clean context (fits the 512-token window) → AILO answers.

python ailo_web.py "What is the tallest mountain in the world?"
# -> "Mount Everest, at 8,848 meters."

This is how a 152M model can answer about events it never saw in training.


💭 Reasoning (thinking)

The model declares the thinking capability: set "think": true and the reasoning trace is returned in message.thinking, separate from the answer (shown in a dedicated box in the Ollama desktop app). Best on reasoning-style prompts; for exact math, pair with a calculator tool.


🐍 Python (Transformers)

from huggingface_hub import hf_hub_download
import torch, tiktoken, sys
repo = "xxrickyxx/ailo-152m"
for f in ["config.json","configuration_ailo.py","modeling_ailo.py","pytorch_model.bin"]:
    hf_hub_download(repo_id=repo, filename=f, local_dir="ailo_v2")
sys.path.insert(0, "ailo_v2")
from modeling_ailo import AILOForCausalLM
from configuration_ailo import AILOConfig
model = AILOForCausalLM(AILOConfig())
model.load_state_dict(torch.load("ailo_v2/pytorch_model.bin", map_location="cpu"), strict=False)
model.eval()
tok = tiktoken.get_encoding("gpt2")
ids = torch.tensor([tok.encode_ordinary("<|user|>\nWhat is the capital of Italy?\n<|assistant|>\n")])
print(tok.decode(model.generate(ids, max_new_tokens=40, temperature=0.3)[0].tolist()))

📐 Model details

Property Value
Parameters 151.9M
Architecture Decoder-only Transformer (LayerNorm · RoPE · SwiGLU)
Layers / Hidden / Heads 12 / 768 / 12
Context length 512 tokens
Vocabulary 50,257 (GPT-2 BPE)
Base AILO-152M (FineWeb-Edu, 182k steps)
Fine-tuning SFT + distillation from Gemma 3 4B: instruction + reasoning (GSM8K) + context-following (SQuAD) + context-override + tool-use
Formats GGUF (q4_k_m, q8_0, f16) + PyTorch

⚠️ Limitations

  • 152M params: limited world knowledge and multi-step reasoning vs larger models.
  • 512-token context: best with short, focused prompts; not for long documents.
  • Web-search quality depends on search-result quality; best for well-defined factual questions.
  • For exact arithmetic, use the tool/agent layer (the calculator does the math).
  • Primarily English.

📜 License

This project uses a dual-license model.

🆓 Non-Commercial License

Released under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0).

You are free to: - Use the model for research, education, and personal projects - Modify and fine-tune the model - Redistribute derivatives under the same license

You must: - Provide attribution - Keep the same license for derivative works - Not use the model for commercial purposes

💼 Commercial License

Commercial use of AILO-152M is not permitted under the free license. Commercial use includes: - Integration into paid products or services - Use in SaaS platforms, APIs, or enterprise systems - Any application that generates revenue directly or indirectly

For commercial licensing, a separate paid agreement (royalty or license fee) is required. Please contact the author.


📬 Contact

For research collaboration or commercial licensing inquiries, contact the project maintainer:

Riccardo Sparacino LinkedIn


📑 Citation

@misc{ailo152m_v2_2026,
  title  = {AILO-152M-v2: A Tiny Instruction-Tuned LLM with Reasoning and Web Search},
  author = {Sparacino, Riccardo},
  year   = {2026},
  note   = {Dual-licensed CC BY-NC-SA 4.0 / commercial}
}

🙏 Acknowledgments

Built with Ollama and llama.cpp. Fine-tuning data: Alpaca-cleaned, GSM8K, SQuAD. Knowledge-distillation teacher: Gemma 3 4B. Embeddings for web re-ranking: sentence-transformers MiniLM.


Keywords: small language model, tiny LLM, 152M, efficient LLM, edge AI, on-device LLM, CPU inference, Raspberry Pi LLM, Ollama model, GGUF, instruction-tuned, reasoning model, web search LLM, RAG, offline assistant, low-resource, fast inference.