10 Downloads Updated 1 month ago
ollama run whereisanzi/maracatu-80m
Updated 1 month ago
1 month ago
9bf5551cd46d · 56MB ·
A Brazilian Portuguese causal language model, trained from scratch. Open weights, Apache 2.0.
Maracatu-80M is a 87.8M-parameter decoder-only transformer trained from scratch on Brazilian Portuguese text. It is the second public checkpoint of the Maracatu AI project — an open effort to build Portuguese-language LLMs with full transparency over architecture, data, and training.
This is a base model (completion only). It continues text — it is not a chat assistant and does not follow instructions.
# pull the default quantization (Q4_K_M, ~54 MB)
ollama pull whereisanzi/maracatu-80m
# run with a prompt
ollama run whereisanzi/maracatu-80m "O Brasil é"
# inspect model metadata
ollama show whereisanzi/maracatu-80m
The model outputs lowercase text only — this is expected; the tokenizer normalizes all input to lowercase.
ollama pull whereisanzi/maracatu-80m:q5_k_m
ollama pull whereisanzi/maracatu-80m:q8_0
ollama run whereisanzi/maracatu-80m:q8_0 "A literatura brasileira é"
| Tag | Method | File size | Recommended for |
|---|---|---|---|
latest / q4_k_m |
Q4_K_M | ~54 MB | General use; best size/quality tradeoff |
q5_k_m |
Q5_K_M | ~61 MB | Slightly higher fidelity, still fast |
q8_0 |
Q8_0 | ~90 MB | Evaluation, debugging, max precision |
fp16 |
f16 | ~168 MB | Full precision reference |
All variants run comfortably on CPU. At 87.8M parameters the difference between Q4_K_M and Q8_0 is perceptible but minor for most prompts.
Llama-style decoder-only transformer (RMSNorm, RoPE, SwiGLU, GQA, no bias in linear layers, weight tying). Compatible bit-for-bit with transformers.LlamaForCausalLM (max_abs_diff=0.0 validated against native forward pass).
| Hyperparameter | Value |
|---|---|
| Total parameters | 87.80M |
| Non-embedding parameters | 75.52M |
| Layers | 12 |
| Hidden size | 768 |
| Attention heads | 12 |
| KV heads | 4 (GQA, 3:1 ratio) |
| Intermediate size (SwiGLU) | 2048 |
| Context length | 1024 tokens |
| Vocabulary | 16,000 (SentencePiece BPE, lowercase, split_digits) |
Corpus v2 — 1.60B tokens. SHA-256: a1000e873bfcae0d2229ecc9b329f0befe8ad73913e79e58f14a1f3a48ef7e58
| Source | License | Tokens |
|---|---|---|
Wikipedia PT (wikimedia/wikipedia, 20231101.pt) |
CC BY-SA 3.0 | ~550M |
| Project Gutenberg PT (24 curated public-domain works) | Public Domain | ~150M |
| CulturaX-PT filtered (1.49M docs) | ODC-BY 1.0 | ~900M |
Filtering: MinHash LSH dedup (Jaccard 0.85), PII regex (CPF, email, CEP, phone BR), language heuristic, byte-level dedup. No raw Common Crawl. No CC BY-NC sources.
| Item | Value |
|---|---|
| Framework | PyTorch |
| Hardware | NVIDIA RTX 3060 12GB (single GPU, self-hosted) |
| Total iterations | 200,000 |
| Tokens seen | ~1.64B (Chinchilla ~21.7 tok/param) |
| Batch size | 8 |
| Precision | bf16 autocast (fp32 weights + optimizer) |
| Optimizer | AdamW (β₁=0.9, β₂=0.95, weight_decay=0.1) |
| Learning rate | 2.5e-4 → 2.5e-5 (4k warmup + cosine decay) |
| Throughput | ~20,200 tok/s (stable throughout) |
| Total training time | 22h 31min continuous |
Validation on a 3.27M-token PT-BR holdout (last chronological segment, not seen during training):
| Metric | Value | Step |
|---|---|---|
| Best validation loss | 3.0163 | ~190,000 |
| Final validation loss | 3.0604 | 200,000 |
| Final validation perplexity | 21.34 | 200,000 |
Zero-shot downstream benchmarks via lm-evaluation-harness 0.4.11:
| Task | Score | Random baseline |
|---|---|---|
| ENEM Challenge (1432 questions) | 20.27% | 20% (5-MCQ) |
| ASSIN Entailment | 29.08% | ~33% (3-class) |
| ASSIN Paraphrase | 52.42% | 50% (binary) |
Honest reading: ENEM is at chance, ASSIN Entailment slightly below, ASSIN Paraphrase modestly above. Pretrain improvements show up in generation fluency (lower perplexity), not in MCQ accuracy. For reference, Tucano-160M reports validation perplexity around 22 on Portuguese text; Maracatu-80M reaches 21.34 with half the parameters, though the comparison is indicative (different harness versions, vocabulary, and holdout splits).
Generated with temperature=0.8, top_k=50, repetition_penalty=1.1, seed 123. All output is lowercase — this is a tokenizer property, not a generation artifact.
Prompt: Machado de Assis nasceu no Rio de Janeiro
machado de assis nasceu no rio de janeiro. estudou na faculdade de direito da universidade federal do rio de janeiro (ufrj). participou das comissões técnicas com a experiência de seu trabalho e da comissão de ética de seus atos, em 1995.
Prompt: Em uma manha de domingo, joao caminhava
em uma manha de domingo, joao caminhava pelo centro da cidade, até um carro da polícia federal na região. quando o policial chegou não sabe o que aconteceu e acabou pegando a arma para ser removida.
What these samples show: fluent Portuguese grammar and topical coherence, with systematic factual hallucination. Machado de Assis did not study at UFRJ. The model produces plausible-sounding text without retrieving accurate facts — this is expected at 80M parameters.
These are not disclaimers — they are accurate descriptions of what this model can and cannot do at 87.8M parameters.
nmt_nfkc_cf normalization (lowercase). The model never generates uppercase characters.huggingface.co/maracatu-ai/maracatu-80m
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("maracatu-ai/maracatu-80m", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("maracatu-ai/maracatu-80m")
model.eval()
inputs = tokenizer("O Brasil é", return_tensors="pt")
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=60, temperature=0.8, top_k=50, do_sample=True)
print(tokenizer.decode(out[0], skip_special_tokens=True))
github.com/maracatu-ai/maracatu
Full source: model architecture, tokenizer training, corpus cleanup, export scripts, experiment logs.
kaggle.com/models/whereisanzi/maracatu-80m
Mirror of the training checkpoint with example notebook.
Code and weights are released under the Apache License 2.0.
Training data licenses are preserved per source (CC BY-SA 3.0 for Wikipedia PT, Public Domain for Project Gutenberg, ODC-BY 1.0 for CulturaX-PT).
@misc{maracatu80m2026,
author = {Anzileiro, Anderson},
title = {Maracatu-80M: An Open-Weight Brazilian Portuguese Language Model},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/maracatu-ai/maracatu-80m}},
}
Full documentation, architecture decisions, experiment logs, and roadmap: github.com/maracatu-ai/maracatu
Maracatu-80M is the second step. The roadmap runs to Maracatu-80B, targeting Llama-3.1-70B performance on Portuguese benchmarks (ENEM, OAB, BLUEX, POSCOMP).