40 Downloads Updated 4 days ago
ollama run Alieno/ailo-152m-v2:152m
Updated 4 days ago
4 days ago
034ee7932adc · 163MB ·
A 152M-parameter language model that runs on almost anything laptops, old PCs, even a Raspberry Pi yet does instruction-following chat, step-by-step reasoning, and web search for fresh facts.
AILO (Artificial Intelligence Language Operator) is a compact, fast, from-scratch transformer. v2 turns the original base model into a real assistant: it answers questions, thinks before answering, and can use live web results to answer about things it was never trained on.
ollama run Alieno/ailo-152m-v2
| 🧠 Parameters | 151.9M |
| ⚡ Speed | up to 384 tok/s (GPU), runs on CPU & edge |
| 📦 Size | 97 MB (q4_k_m) – 305 MB (f16) |
| 🌐 Web search | yes (context-following) |
| 💭 Reasoning | yes (<think>) |
| 🪶 Min RAM | ~300 MB |
Great for: edge/on-device AI, offline assistants, learning how LLMs work, fast prototyping, low-power servers, privacy-first chatbots.
ollama run Alieno/ailo-152m-v2
>>> What is the capital of Italy?
The capital city of Italy is Rome.
Tags: :latest / :q8_0 (best quality, 156 MB) · :q4_k_m (smallest, 97 MB) · :f16 (full precision, 305 MB)
curl http://localhost:11434/api/chat -d '{
"model": "Alieno/ailo-152m-v2",
"messages": [{"role": "user", "content": "Explain what gravity is."}]
}'
Evaluated via Ollama /api/chat on factual QA, reasoning and coherence vs comparable and larger models:
| Model | Params | Factual | Reasoning | Coherence | Speed (tok/s) |
|---|---|---|---|---|---|
| AILO-152M-v2 | 152M | 7⁄8 | 1–2/5 | 100% | 384 🥇 |
| SmolLM2 | 135M | 8⁄8 | 1⁄5 | 98% | 403 |
| Qwen2.5 | 500M | 8⁄8 | 3–4/5 | 96% | 213 |
| TinyLlama | 1.1B | 8⁄8 | 1–2/5 | 97% | 260 |
Measured on an NVIDIA RTX 5060 Ti. Reasoning has run-to-run variance on an 8⁄5-question micro-suite.
AILO-152M is tiny, so it runs far beyond high-end GPUs including old and low-power hardware. Approximate generation speed (q4_k_m, ~97 MB):
| Hardware | Type | Est. speed (tok/s) | Notes |
|---|---|---|---|
| RTX 5060 Ti / 4070+ | Modern GPU | 350–450 | ✅ measured: 384 (q8_0) |
| RTX 3060 / 2070 | Mid GPU | ~250–350 | smooth real-time |
| GTX 1660 / 1060 | Older GPU | ~150–220 | still real-time |
| GTX 1050 / MX150 | Old laptop GPU | ~90–140 | very usable |
| Ryzen 7 / Core i7 (recent) | Modern CPU | ~45–80 | no GPU needed |
| Core i5 ~2015 | Old CPU | ~18–30 | usable for chat |
| Raspberry Pi 5 | SBC / edge | ~10–16 | runs offline |
| Raspberry Pi 4 | Low-power SBC | ~5–9 | runs offline |
| Recent smartphone | Mobile | ~15–35 | via llama.cpp/Termux |
Estimates except the measured RTX 5060 Ti; real numbers vary with quantization, RAM bandwidth and build flags. The takeaway: AILO runs even where larger models can’t load at all.
Minimum requirements: ~300 MB RAM (q4_k_m), any x86-64 / ARM CPU. No GPU required.
Trained on this template (tags are plain GPT-2 BPE sequences no vocab extension):
<|user|>
{question}
<|assistant|>
<think>{optional reasoning}</think>
{answer}<|end|>
AILO v2 is trained for context-following with override: give it search results and it answers from them even when they contradict its training-time knowledge, so it can use up-to-date facts. When no context is given, it falls back to its own (true) knowledge.
A ready pipeline is included (ailo_web.py): DuckDuckGo → instant-answer + semantic re-ranking (MiniLM) with language/relevance filters → short clean context (fits the 512-token window) → AILO answers.
python ailo_web.py "What is the tallest mountain in the world?"
# -> "Mount Everest, at 8,848 meters."
This is how a 152M model can answer about events it never saw in training.
The model declares the thinking capability: set "think": true and the reasoning trace is returned in message.thinking, separate from the answer (shown in a dedicated box in the Ollama desktop app). Best on reasoning-style prompts; for exact math, pair with a calculator tool.
from huggingface_hub import hf_hub_download
import torch, tiktoken, sys
repo = "xxrickyxx/ailo-152m"
for f in ["config.json","configuration_ailo.py","modeling_ailo.py","pytorch_model.bin"]:
hf_hub_download(repo_id=repo, filename=f, local_dir="ailo_v2")
sys.path.insert(0, "ailo_v2")
from modeling_ailo import AILOForCausalLM
from configuration_ailo import AILOConfig
model = AILOForCausalLM(AILOConfig())
model.load_state_dict(torch.load("ailo_v2/pytorch_model.bin", map_location="cpu"), strict=False)
model.eval()
tok = tiktoken.get_encoding("gpt2")
ids = torch.tensor([tok.encode_ordinary("<|user|>\nWhat is the capital of Italy?\n<|assistant|>\n")])
print(tok.decode(model.generate(ids, max_new_tokens=40, temperature=0.3)[0].tolist()))
| Property | Value |
|---|---|
| Parameters | 151.9M |
| Architecture | Decoder-only Transformer (LayerNorm · RoPE · SwiGLU) |
| Layers / Hidden / Heads | 12 / 768 / 12 |
| Context length | 512 tokens |
| Vocabulary | 50,257 (GPT-2 BPE) |
| Base | AILO-152M (FineWeb-Edu, 182k steps) |
| Fine-tuning | SFT + distillation from Gemma 3 4B: instruction + reasoning (GSM8K) + context-following (SQuAD) + context-override + tool-use |
| Formats | GGUF (q4_k_m, q8_0, f16) + PyTorch |
This project uses a dual-license model.
Released under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0).
You are free to: - Use the model for research, education, and personal projects - Modify and fine-tune the model - Redistribute derivatives under the same license
You must: - Provide attribution - Keep the same license for derivative works - Not use the model for commercial purposes
Commercial use of AILO-152M is not permitted under the free license. Commercial use includes: - Integration into paid products or services - Use in SaaS platforms, APIs, or enterprise systems - Any application that generates revenue directly or indirectly
For commercial licensing, a separate paid agreement (royalty or license fee) is required. Please contact the author.
For research collaboration or commercial licensing inquiries, contact the project maintainer:
Riccardo Sparacino LinkedIn
@misc{ailo152m_v2_2026,
title = {AILO-152M-v2: A Tiny Instruction-Tuned LLM with Reasoning and Web Search},
author = {Sparacino, Riccardo},
year = {2026},
note = {Dual-licensed CC BY-NC-SA 4.0 / commercial}
}
Built with Ollama and llama.cpp. Fine-tuning data: Alpaca-cleaned, GSM8K, SQuAD. Knowledge-distillation teacher: Gemma 3 4B. Embeddings for web re-ranking: sentence-transformers MiniLM.
Keywords: small language model, tiny LLM, 152M, efficient LLM, edge AI, on-device LLM, CPU inference, Raspberry Pi LLM, Ollama model, GGUF, instruction-tuned, reasoning model, web search LLM, RAG, offline assistant, low-resource, fast inference.