10 1 month ago

Open-weight Brazilian Portuguese LLM trained from scratch on 1.6B tokens. 87.8M params, Llama-style with GQA. Validation perplexity 21.34. Apache 2.0. Base model.

ollama run whereisanzi/maracatu-80m

Details

1 month ago

9bf5551cd46d · 56MB ·

llama
·
87.8M
·
Q4_K_M
Maracatu-80M e um modelo de linguagem brasileiro de 87.8M parametros (75.5M nao-embedding), treinado
Apache License 2.0 Maracatu-80M weights and tokenizer released under Apache 2.0. Code: https://githu
{ "num_ctx": 1024, "repeat_penalty": 1.1, "temperature": 0.8, "top_k": 50, "top_
{{ .Prompt }}

Readme

🥁 Maracatu-80M

A Brazilian Portuguese causal language model, trained from scratch. Open weights, Apache 2.0.

Maracatu-80M is a 87.8M-parameter decoder-only transformer trained from scratch on Brazilian Portuguese text. It is the second public checkpoint of the Maracatu AI project — an open effort to build Portuguese-language LLMs with full transparency over architecture, data, and training.

This is a base model (completion only). It continues text — it is not a chat assistant and does not follow instructions.


Quick start

# pull the default quantization (Q4_K_M, ~54 MB)
ollama pull whereisanzi/maracatu-80m

# run with a prompt
ollama run whereisanzi/maracatu-80m "O Brasil é"

# inspect model metadata
ollama show whereisanzi/maracatu-80m

The model outputs lowercase text only — this is expected; the tokenizer normalizes all input to lowercase.

Specific quantizations

ollama pull whereisanzi/maracatu-80m:q5_k_m
ollama pull whereisanzi/maracatu-80m:q8_0

ollama run whereisanzi/maracatu-80m:q8_0 "A literatura brasileira é"

Available quantizations

Tag Method File size Recommended for
latest / q4_k_m Q4_K_M ~54 MB General use; best size/quality tradeoff
q5_k_m Q5_K_M ~61 MB Slightly higher fidelity, still fast
q8_0 Q8_0 ~90 MB Evaluation, debugging, max precision
fp16 f16 ~168 MB Full precision reference

All variants run comfortably on CPU. At 87.8M parameters the difference between Q4_K_M and Q8_0 is perceptible but minor for most prompts.


About this model

Architecture

Llama-style decoder-only transformer (RMSNorm, RoPE, SwiGLU, GQA, no bias in linear layers, weight tying). Compatible bit-for-bit with transformers.LlamaForCausalLM (max_abs_diff=0.0 validated against native forward pass).

Hyperparameter Value
Total parameters 87.80M
Non-embedding parameters 75.52M
Layers 12
Hidden size 768
Attention heads 12
KV heads 4 (GQA, 3:1 ratio)
Intermediate size (SwiGLU) 2048
Context length 1024 tokens
Vocabulary 16,000 (SentencePiece BPE, lowercase, split_digits)

Training data

Corpus v2 — 1.60B tokens. SHA-256: a1000e873bfcae0d2229ecc9b329f0befe8ad73913e79e58f14a1f3a48ef7e58

Source License Tokens
Wikipedia PT (wikimedia/wikipedia, 20231101.pt) CC BY-SA 3.0 ~550M
Project Gutenberg PT (24 curated public-domain works) Public Domain ~150M
CulturaX-PT filtered (1.49M docs) ODC-BY 1.0 ~900M

Filtering: MinHash LSH dedup (Jaccard 0.85), PII regex (CPF, email, CEP, phone BR), language heuristic, byte-level dedup. No raw Common Crawl. No CC BY-NC sources.

Training

Item Value
Framework PyTorch
Hardware NVIDIA RTX 3060 12GB (single GPU, self-hosted)
Total iterations 200,000
Tokens seen ~1.64B (Chinchilla ~21.7 tok/param)
Batch size 8
Precision bf16 autocast (fp32 weights + optimizer)
Optimizer AdamW (β₁=0.9, β₂=0.95, weight_decay=0.1)
Learning rate 2.5e-4 → 2.5e-5 (4k warmup + cosine decay)
Throughput ~20,200 tok/s (stable throughout)
Total training time 22h 31min continuous

Evaluation

Validation on a 3.27M-token PT-BR holdout (last chronological segment, not seen during training):

Metric Value Step
Best validation loss 3.0163 ~190,000
Final validation loss 3.0604 200,000
Final validation perplexity 21.34 200,000

Zero-shot downstream benchmarks via lm-evaluation-harness 0.4.11:

Task Score Random baseline
ENEM Challenge (1432 questions) 20.27% 20% (5-MCQ)
ASSIN Entailment 29.08% ~33% (3-class)
ASSIN Paraphrase 52.42% 50% (binary)

Honest reading: ENEM is at chance, ASSIN Entailment slightly below, ASSIN Paraphrase modestly above. Pretrain improvements show up in generation fluency (lower perplexity), not in MCQ accuracy. For reference, Tucano-160M reports validation perplexity around 22 on Portuguese text; Maracatu-80M reaches 21.34 with half the parameters, though the comparison is indicative (different harness versions, vocabulary, and holdout splits).


Sample outputs

Generated with temperature=0.8, top_k=50, repetition_penalty=1.1, seed 123. All output is lowercase — this is a tokenizer property, not a generation artifact.

Prompt: Machado de Assis nasceu no Rio de Janeiro

machado de assis nasceu no rio de janeiro. estudou na faculdade de direito da universidade federal do rio de janeiro (ufrj). participou das comissões técnicas com a experiência de seu trabalho e da comissão de ética de seus atos, em 1995.

Prompt: Em uma manha de domingo, joao caminhava

em uma manha de domingo, joao caminhava pelo centro da cidade, até um carro da polícia federal na região. quando o policial chegou não sabe o que aconteceu e acabou pegando a arma para ser removida.

What these samples show: fluent Portuguese grammar and topical coherence, with systematic factual hallucination. Machado de Assis did not study at UFRJ. The model produces plausible-sounding text without retrieving accurate facts — this is expected at 80M parameters.


Limitations

These are not disclaimers — they are accurate descriptions of what this model can and cannot do at 87.8M parameters.

  • Scale: 87.8M is small by 2026 standards. Factual recall is unreliable. Hallucination is the norm, not the exception.
  • Lowercase only: The tokenizer applies nmt_nfkc_cf normalization (lowercase). The model never generates uppercase characters.
  • Digit splitting: Numbers are tokenized digit-by-digit. Dates, arithmetic, and numeric reasoning are not reliable.
  • Mixed register: Trained primarily on Wikipedia, public-domain literature, and filtered web text. Output tends toward formal/encyclopedic prose; informal and conversational registers are underrepresented.
  • Context window: 1,024 tokens. Longer inputs are truncated.
  • No safety fine-tuning: Unfiltered base model. Not evaluated for harmful outputs; may generate biased, incorrect, or offensive content.
  • No instruction following: Prompting it like a chat assistant will not work as expected. It continues the prompt as text completion.
  • MCQ benchmarks at or near random chance: ENEM and ASSIN Entailment indistinguishable from guessing; ASSIN Paraphrase modestly above. Downstream task performance is not the strength of this release.

Use elsewhere

HuggingFace Hub — safetensors + GGUF files

huggingface.co/maracatu-ai/maracatu-80m

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("maracatu-ai/maracatu-80m", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("maracatu-ai/maracatu-80m")
model.eval()

inputs = tokenizer("O Brasil é", return_tensors="pt")
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=60, temperature=0.8, top_k=50, do_sample=True)
print(tokenizer.decode(out[0], skip_special_tokens=True))

GitHub — code, architecture, training scripts, docs

github.com/maracatu-ai/maracatu

Full source: model architecture, tokenizer training, corpus cleanup, export scripts, experiment logs.

Kaggle Models — checkpoint + metadata

kaggle.com/models/whereisanzi/maracatu-80m

Mirror of the training checkpoint with example notebook.


License & citation

Code and weights are released under the Apache License 2.0.

Training data licenses are preserved per source (CC BY-SA 3.0 for Wikipedia PT, Public Domain for Project Gutenberg, ODC-BY 1.0 for CulturaX-PT).

@misc{maracatu80m2026,
  author       = {Anzileiro, Anderson},
  title        = {Maracatu-80M: An Open-Weight Brazilian Portuguese Language Model},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/maracatu-ai/maracatu-80m}},
}

More

Full documentation, architecture decisions, experiment logs, and roadmap: github.com/maracatu-ai/maracatu

Maracatu-80M is the second step. The roadmap runs to Maracatu-80B, targeting Llama-3.1-70B performance on Portuguese benchmarks (ENEM, OAB, BLUEX, POSCOMP).