A 152M-parameter language model that runs on almost anything laptops, old PCs, even a Raspberry Pi yet does instruction-following chat, step-by-step reasoning, and web search for fresh facts.

Details

Updated 1 month ago

1 month ago

034ee7932adc · 163MB ·

model

archllama

parameters152M

quantizationQ8_0

163MB

params

{ "num_ctx": 512, "num_predict": 256, "repeat_penalty": 1.3, "stop": [ "<|en

193B

template

{{- if .Tools }}{{ end }}{{- $u := "" }}{{- range .Messages }}{{- if eq .Role "user" }}{{- $u = .Con

291B

AILO-152M-v2 Tiny LLM with Chat, Reasoning & Web Search ⚡

A 152M-parameter language model that runs on almost anything laptops, old PCs, even a Raspberry Pi yet does instruction-following chat, step-by-step reasoning, and web search for fresh facts.

AILO (Artificial Intelligence Language Operator) is a compact, fast, from-scratch transformer. v2 turns the original base model into a real assistant: it answers questions, thinks before answering, and can use live web results to answer about things it was never trained on.

ollama run Alieno/ailo-152m-v2


🧠 Parameters	151.9M
⚡ Speed	up to 384 tok/s (GPU), runs on CPU & edge
📦 Size	97 MB (q4_k_m) – 305 MB (f16)
🌐 Web search	yes (context-following)
💭 Reasoning	yes (`<think>`)
🪶 Min RAM	~300 MB

✨ Why AILO-152M-v2?

Runs anywhere 97 MB quantized, ~300 MB RAM. Old laptops, mini-PCs, Raspberry Pi, phones.
Fast fastest in its class (see benchmarks). Real-time chat even on modest hardware.
Web-aware trained for context-following, so it answers from fresh search results instead of stale memory.
Distilled from a bigger model answers learned from Gemma 3 4B (knowledge distillation): richer, better-structured replies than its size suggests.
Honest small model strong at concise factual Q&A and conversation; pairs with tools for exact math.
Open & local no cloud, full privacy, drop-in for Ollama.

Great for: edge/on-device AI, offline assistants, learning how LLMs work, fast prototyping, low-power servers, privacy-first chatbots.

🚀 Quick start

Ollama (recommended)

ollama run Alieno/ailo-152m-v2
>>> What is the capital of Italy?
The capital city of Italy is Rome.

Tags: :latest / :q8_0 (best quality, 156 MB) · :q4_k_m (smallest, 97 MB) · :f16 (full precision, 305 MB)

API

curl http://localhost:11434/api/chat -d '{
  "model": "Alieno/ailo-152m-v2",
  "messages": [{"role": "user", "content": "Explain what gravity is."}]
}'

🏆 Benchmarks

Evaluated via Ollama /api/chat on factual QA, reasoning and coherence vs comparable and larger models:

Model	Params	Factual	Reasoning	Coherence	Speed (tok/s)
AILO-152M-v2	152M	⁷⁄₈	1–2/5	100%	384 🥇
SmolLM2	135M	⁸⁄₈	¹⁄₅	98%	403
Qwen2.5	500M	⁸⁄₈	3–4/5	96%	213
TinyLlama	1.1B	⁸⁄₈	1–2/5	97%	260

🥇 Top coherence (100% virtually no repetition) and among the fastest.
Competitive on factual accuracy with models its size and larger.
Trails only bigger instruction-tuned models on multi-step reasoning expected for the smallest, from-scratch model here.

Measured on an NVIDIA RTX 5060 Ti. Reasoning has run-to-run variance on an ⁸⁄₅-question micro-suite.

🖥️ Hardware & performance

AILO-152M is tiny, so it runs far beyond high-end GPUs including old and low-power hardware. Approximate generation speed (q4_k_m, ~97 MB):

Hardware	Type	Est. speed (tok/s)	Notes
RTX 5060 Ti / 4070+	Modern GPU	350–450	✅ measured: 384 (q8_0)
RTX 3060 / 2070	Mid GPU	~250–350	smooth real-time
GTX 1660 / 1060	Older GPU	~150–220	still real-time
GTX 1050 / MX150	Old laptop GPU	~90–140	very usable
Ryzen 7 / Core i7 (recent)	Modern CPU	~45–80	no GPU needed
Core i5 ~2015	Old CPU	~18–30	usable for chat
Raspberry Pi 5	SBC / edge	~10–16	runs offline
Raspberry Pi 4	Low-power SBC	~5–9	runs offline
Recent smartphone	Mobile	~15–35	via llama.cpp/Termux

Estimates except the measured RTX 5060 Ti; real numbers vary with quantization, RAM bandwidth and build flags. The takeaway: AILO runs even where larger models can’t load at all.

Minimum requirements: ~300 MB RAM (q4_k_m), any x86-64 / ARM CPU. No GPU required.

💬 Chat format

Trained on this template (tags are plain GPT-2 BPE sequences no vocab extension):

<|user|>
{question}
<|assistant|>
<think>{optional reasoning}</think>
{answer}<|end|>

🌐 Web search (fresh facts)

AILO v2 is trained for context-following with override: give it search results and it answers from them even when they contradict its training-time knowledge, so it can use up-to-date facts. When no context is given, it falls back to its own (true) knowledge.

A ready pipeline is included (ailo_web.py): DuckDuckGo → instant-answer + semantic re-ranking (MiniLM) with language/relevance filters → short clean context (fits the 512-token window) → AILO answers.

python ailo_web.py "What is the tallest mountain in the world?"
# -> "Mount Everest, at 8,848 meters."

This is how a 152M model can answer about events it never saw in training.

💭 Reasoning (thinking)

The model declares the thinking capability: set "think": true and the reasoning trace is returned in message.thinking, separate from the answer (shown in a dedicated box in the Ollama desktop app). Best on reasoning-style prompts; for exact math, pair with a calculator tool.

🐍 Python (Transformers)

from huggingface_hub import hf_hub_download
import torch, tiktoken, sys
repo = "xxrickyxx/ailo-152m"
for f in ["config.json","configuration_ailo.py","modeling_ailo.py","pytorch_model.bin"]:
    hf_hub_download(repo_id=repo, filename=f, local_dir="ailo_v2")
sys.path.insert(0, "ailo_v2")
from modeling_ailo import AILOForCausalLM
from configuration_ailo import AILOConfig
model = AILOForCausalLM(AILOConfig())
model.load_state_dict(torch.load("ailo_v2/pytorch_model.bin", map_location="cpu"), strict=False)
model.eval()
tok = tiktoken.get_encoding("gpt2")
ids = torch.tensor([tok.encode_ordinary("<|user|>\nWhat is the capital of Italy?\n<|assistant|>\n")])
print(tok.decode(model.generate(ids, max_new_tokens=40, temperature=0.3)[0].tolist()))

📐 Model details

Property	Value
Parameters	151.9M
Architecture	Decoder-only Transformer (LayerNorm · RoPE · SwiGLU)
Layers / Hidden / Heads	12 / 768 / 12
Context length	512 tokens
Vocabulary	50,257 (GPT-2 BPE)
Base	AILO-152M (FineWeb-Edu, 182k steps)
Fine-tuning	SFT + distillation from Gemma 3 4B: instruction + reasoning (GSM8K) + context-following (SQuAD) + context-override + tool-use
Formats	GGUF (q4_k_m, q8_0, f16) + PyTorch

⚠️ Limitations

152M params: limited world knowledge and multi-step reasoning vs larger models.
512-token context: best with short, focused prompts; not for long documents.
Web-search quality depends on search-result quality; best for well-defined factual questions.
For exact arithmetic, use the tool/agent layer (the calculator does the math).
Primarily English.

📜 License

This project uses a dual-license model.

🆓 Non-Commercial License

Released under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0).

You are free to: - Use the model for research, education, and personal projects - Modify and fine-tune the model - Redistribute derivatives under the same license

You must: - Provide attribution - Keep the same license for derivative works - Not use the model for commercial purposes

💼 Commercial License

Commercial use of AILO-152M is not permitted under the free license. Commercial use includes: - Integration into paid products or services - Use in SaaS platforms, APIs, or enterprise systems - Any application that generates revenue directly or indirectly

For commercial licensing, a separate paid agreement (royalty or license fee) is required. Please contact the author.

📬 Contact

For research collaboration or commercial licensing inquiries, contact the project maintainer:

Riccardo Sparacino LinkedIn

📑 Citation

@misc{ailo152m_v2_2026,
  title  = {AILO-152M-v2: A Tiny Instruction-Tuned LLM with Reasoning and Web Search},
  author = {Sparacino, Riccardo},
  year   = {2026},
  note   = {Dual-licensed CC BY-NC-SA 4.0 / commercial}
}

🙏 Acknowledgments

Built with Ollama and llama.cpp. Fine-tuning data: Alpaca-cleaned, GSM8K, SQuAD. Knowledge-distillation teacher: Gemma 3 4B. Embeddings for web re-ranking: sentence-transformers MiniLM.

Keywords: small language model, tiny LLM, 152M, efficient LLM, edge AI, on-device LLM, CPU inference, Raspberry Pi LLM, Ollama model, GGUF, instruction-tuned, reasoning model, web search LLM, RAG, offline assistant, low-resource, fast inference.