Open-weight Brazilian Portuguese LLM trained from scratch on 1.6B tokens. 87.8M params, Llama-style with GQA. Validation perplexity 21.34. Apache 2.0. Base model.

Details

Updated 1 month ago

1 month ago

9bf5551cd46d · 56MB ·

model

archllama

parameters87.8M

quantizationQ4_K_M

56MB

system

Maracatu-80M e um modelo de linguagem brasileiro de 87.8M parametros (75.5M nao-embedding), treinado

208B

license

Apache License 2.0 Maracatu-80M weights and tokenizer released under Apache 2.0. Code: https://githu

186B

params

{ "num_ctx": 1024, "repeat_penalty": 1.1, "temperature": 0.8, "top_k": 50, "top_

79B

template

13B

🥁 Maracatu-80M

A Brazilian Portuguese causal language model, trained from scratch. Open weights, Apache 2.0.

Maracatu-80M is a 87.8M-parameter decoder-only transformer trained from scratch on Brazilian Portuguese text. It is the second public checkpoint of the Maracatu AI project — an open effort to build Portuguese-language LLMs with full transparency over architecture, data, and training.

This is a base model (completion only). It continues text — it is not a chat assistant and does not follow instructions.

Quick start

# pull the default quantization (Q4_K_M, ~54 MB)
ollama pull whereisanzi/maracatu-80m

# run with a prompt
ollama run whereisanzi/maracatu-80m "O Brasil é"

# inspect model metadata
ollama show whereisanzi/maracatu-80m

The model outputs lowercase text only — this is expected; the tokenizer normalizes all input to lowercase.

Specific quantizations

ollama pull whereisanzi/maracatu-80m:q5_k_m
ollama pull whereisanzi/maracatu-80m:q8_0

ollama run whereisanzi/maracatu-80m:q8_0 "A literatura brasileira é"

Available quantizations

Tag	Method	File size	Recommended for
`latest` / `q4_k_m`	Q4_K_M	~54 MB	General use; best size/quality tradeoff
`q5_k_m`	Q5_K_M	~61 MB	Slightly higher fidelity, still fast
`q8_0`	Q8_0	~90 MB	Evaluation, debugging, max precision
`fp16`	f16	~168 MB	Full precision reference

All variants run comfortably on CPU. At 87.8M parameters the difference between Q4_K_M and Q8_0 is perceptible but minor for most prompts.

About this model

Architecture

Llama-style decoder-only transformer (RMSNorm, RoPE, SwiGLU, GQA, no bias in linear layers, weight tying). Compatible bit-for-bit with transformers.LlamaForCausalLM (max_abs_diff=0.0 validated against native forward pass).

Hyperparameter	Value
Total parameters	87.80M
Non-embedding parameters	75.52M
Layers	12
Hidden size	768
Attention heads	12
KV heads	4 (GQA, 3:1 ratio)
Intermediate size (SwiGLU)	2048
Context length	1024 tokens
Vocabulary	16,000 (SentencePiece BPE, lowercase, split_digits)

Training data

Corpus v2 — 1.60B tokens. SHA-256: a1000e873bfcae0d2229ecc9b329f0befe8ad73913e79e58f14a1f3a48ef7e58

Source	License	Tokens
Wikipedia PT (`wikimedia/wikipedia`, `20231101.pt`)	CC BY-SA 3.0	~550M
Project Gutenberg PT (24 curated public-domain works)	Public Domain	~150M
CulturaX-PT filtered (1.49M docs)	ODC-BY 1.0	~900M

Filtering: MinHash LSH dedup (Jaccard 0.85), PII regex (CPF, email, CEP, phone BR), language heuristic, byte-level dedup. No raw Common Crawl. No CC BY-NC sources.

Training

Item	Value
Framework	PyTorch
Hardware	NVIDIA RTX 3060 12GB (single GPU, self-hosted)
Total iterations	200,000
Tokens seen	~1.64B (Chinchilla ~21.7 tok/param)
Batch size	8
Precision	bf16 autocast (fp32 weights + optimizer)
Optimizer	AdamW (β₁=0.9, β₂=0.95, weight_decay=0.1)
Learning rate	2.5e-4 → 2.5e-5 (4k warmup + cosine decay)
Throughput	~20,200 tok/s (stable throughout)
Total training time	22h 31min continuous

Evaluation

Validation on a 3.27M-token PT-BR holdout (last chronological segment, not seen during training):

Metric	Value	Step
Best validation loss	3.0163	~190,000
Final validation loss	3.0604	200,000
Final validation perplexity	21.34	200,000

Zero-shot downstream benchmarks via lm-evaluation-harness 0.4.11:

Task	Score	Random baseline
ENEM Challenge (1432 questions)	20.27%	20% (5-MCQ)
ASSIN Entailment	29.08%	~33% (3-class)
ASSIN Paraphrase	52.42%	50% (binary)

Honest reading: ENEM is at chance, ASSIN Entailment slightly below, ASSIN Paraphrase modestly above. Pretrain improvements show up in generation fluency (lower perplexity), not in MCQ accuracy. For reference, Tucano-160M reports validation perplexity around 22 on Portuguese text; Maracatu-80M reaches 21.34 with half the parameters, though the comparison is indicative (different harness versions, vocabulary, and holdout splits).

Sample outputs

Generated with temperature=0.8, top_k=50, repetition_penalty=1.1, seed 123. All output is lowercase — this is a tokenizer property, not a generation artifact.

Prompt: Machado de Assis nasceu no Rio de Janeiro

machado de assis nasceu no rio de janeiro. estudou na faculdade de direito da universidade federal do rio de janeiro (ufrj). participou das comissões técnicas com a experiência de seu trabalho e da comissão de ética de seus atos, em 1995.

Prompt: Em uma manha de domingo, joao caminhava

em uma manha de domingo, joao caminhava pelo centro da cidade, até um carro da polícia federal na região. quando o policial chegou não sabe o que aconteceu e acabou pegando a arma para ser removida.

What these samples show: fluent Portuguese grammar and topical coherence, with systematic factual hallucination. Machado de Assis did not study at UFRJ. The model produces plausible-sounding text without retrieving accurate facts — this is expected at 80M parameters.

Limitations

These are not disclaimers — they are accurate descriptions of what this model can and cannot do at 87.8M parameters.

Scale: 87.8M is small by 2026 standards. Factual recall is unreliable. Hallucination is the norm, not the exception.
Lowercase only: The tokenizer applies nmt_nfkc_cf normalization (lowercase). The model never generates uppercase characters.
Digit splitting: Numbers are tokenized digit-by-digit. Dates, arithmetic, and numeric reasoning are not reliable.
Mixed register: Trained primarily on Wikipedia, public-domain literature, and filtered web text. Output tends toward formal/encyclopedic prose; informal and conversational registers are underrepresented.
Context window: 1,024 tokens. Longer inputs are truncated.
No safety fine-tuning: Unfiltered base model. Not evaluated for harmful outputs; may generate biased, incorrect, or offensive content.
No instruction following: Prompting it like a chat assistant will not work as expected. It continues the prompt as text completion.
MCQ benchmarks at or near random chance: ENEM and ASSIN Entailment indistinguishable from guessing; ASSIN Paraphrase modestly above. Downstream task performance is not the strength of this release.

Use elsewhere

HuggingFace Hub — safetensors + GGUF files

huggingface.co/maracatu-ai/maracatu-80m

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("maracatu-ai/maracatu-80m", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("maracatu-ai/maracatu-80m")
model.eval()

inputs = tokenizer("O Brasil é", return_tensors="pt")
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=60, temperature=0.8, top_k=50, do_sample=True)
print(tokenizer.decode(out[0], skip_special_tokens=True))

GitHub — code, architecture, training scripts, docs

github.com/maracatu-ai/maracatu

Full source: model architecture, tokenizer training, corpus cleanup, export scripts, experiment logs.

Kaggle Models — checkpoint + metadata

kaggle.com/models/whereisanzi/maracatu-80m

Mirror of the training checkpoint with example notebook.

License & citation

Code and weights are released under the Apache License 2.0.

Training data licenses are preserved per source (CC BY-SA 3.0 for Wikipedia PT, Public Domain for Project Gutenberg, ODC-BY 1.0 for CulturaX-PT).

@misc{maracatu80m2026,
  author       = {Anzileiro, Anderson},
  title        = {Maracatu-80M: An Open-Weight Brazilian Portuguese Language Model},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/maracatu-ai/maracatu-80m}},
}

Full documentation, architecture decisions, experiment logs, and roadmap: github.com/maracatu-ai/maracatu

Maracatu-80M is the second step. The roadmap runs to Maracatu-80B, targeting Llama-3.1-70B performance on Portuguese benchmarks (ENEM, OAB, BLUEX, POSCOMP).