9 1 month ago

17M-param Brazilian Portuguese LLM, trained from scratch on Wikipedia PT. Base model, Apache 2.0.

ollama run whereisanzi/maracatu-20m

Details

1 month ago

71414367cd10 · 14MB ·

llama
·
16.8M
·
Q4_K_M
Maracatu maracatu-20m — modelo de linguagem base treinado em português brasileiro. Completa texto
Apache License 2.0 Copyright 2026 Anderson Anzileiro / Maracatu AI Licensed under the Apache License
{ "num_ctx": 512, "repeat_penalty": 1.1, "stop": [ "</s>", "<unk>" ]
{{ .Prompt }}

Readme

🥁 Maracatu-20M

A Brazilian Portuguese causal language model, trained from scratch. Open weights, Apache 2.0.

Maracatu-20M is a 17M-parameter decoder-only transformer trained from scratch on Brazilian Portuguese Wikipedia. It is the first public checkpoint of the Maracatu AI project — an open effort to build Portuguese-language LLMs with full transparency over architecture, data, and training.

This is a base model (completion only). It continues text — it is not a chat assistant and does not follow instructions.


Quick start

# pull the default quantization (Q4_K_M, ~11 MB)
ollama pull whereisanzi/maracatu-20m

# run with a prompt
ollama run whereisanzi/maracatu-20m "O Brasil é"

# inspect model metadata
ollama show whereisanzi/maracatu-20m

The model outputs lowercase text only — this is expected; the tokenizer normalizes all input to lowercase.

Specific quantizations

ollama pull whereisanzi/maracatu-20m:q5_k_m
ollama pull whereisanzi/maracatu-20m:q8_0

ollama run whereisanzi/maracatu-20m:q8_0 "A literatura brasileira é"

Available quantizations

Tag Method File size Recommended for
latest / q4_k_m Q4_K_M ~11 MB General use; best size/quality tradeoff
q5_k_m Q5_K_M ~13 MB Slightly higher fidelity, still fast
q8_0 Q8_0 ~18 MB Evaluation, debugging, max precision

All three run comfortably on CPU. The model is small enough that quantization differences are perceptible but minor at this parameter count.


About this model

Architecture

Llama-style decoder-only transformer (RMSNorm, RoPE, SwiGLU, no bias in linear layers, weight tying).

Hyperparameter Value
Total parameters 17M (16.77M)
Non-embedding parameters 10.62M
Layers 6
Hidden size 384
Attention heads 6
Intermediate size (SwiGLU) 1024
Context length 512 tokens
Vocabulary 16,000 (SentencePiece BPE, lowercase, split_digits)

Training data

Property Value
Source Wikipedia PT (wikimedia/wikipedia, snapshot 20231101.pt)
License CC BY-SA 4.0
Articles 979,492 (after filters + dedup)
Corpus size 2.28 GB
Tokens ~550M BPE tokens

Training

Item Value
Framework PyTorch
Hardware Kaggle T4 (single GPU, 15.6 GB VRAM)
Total iterations 50,000
Tokens seen ~410M (~0.75 epoch)
Batch size 16
Optimizer AdamW (β₁=0.9, β₂=0.95, weight_decay=0.1)
Learning rate 3e-4 → 3e-5 (warmup + cosine decay)
Total training time 5h 45min

Evaluation

Metric Value Step
Best validation perplexity 23.81 43,500
Best validation loss 3.1703 43,500
Train/val gap ~0.05

No measurable overfitting. For reference, Tucano-160M reports validation perplexity ~30 on Portuguese text; Maracatu-20M reaches 23.81 with 10× fewer parameters.


Sample outputs

Generated with temperature=0.8, top_k=50, seed=42. All output is lowercase — this is a tokenizer property, not a generation artifact.

Prompt: O Brasil é

o brasil é uma espécie de ave da família dos caririformes.

Prompt: A capital de Pernambuco é

a capital de pernambuco é um município brasileiro do estado do rio de janeiro.

What these samples show: the model produces syntactically plausible Portuguese with an encyclopedic style. They also illustrate the primary limitation at this scale: factual hallucination is common and expected. The capital of Pernambuco is Recife. The model does not know this reliably.


Limitations

These are not disclaimers — they are accurate descriptions of what this model can and cannot do at 17M parameters.

  • Scale: 17M parameters is small. Factual recall is unreliable. Hallucination is the norm, not the exception.
  • Lowercase only: The tokenizer applies nmt_nfkc_cf normalization (lowercase). The model never generates uppercase characters.
  • Digit splitting: Numbers are tokenized digit-by-digit. Dates, arithmetic, and numeric reasoning are not reliable.
  • Encyclopedic register: Trained on Wikipedia only. Output tends toward formal, encyclopedic prose. Informal and conversational registers are underrepresented.
  • Portuguese Wikipedia bias: Topics with sparse PT-BR Wikipedia coverage produce lower-quality output.
  • No safety fine-tuning: This is an unfiltered base model. It has not been evaluated for harmful outputs and may generate biased, incorrect, or offensive content.
  • No instruction following: Prompting it like a chat assistant will not work as expected. It continues the prompt as text completion.

Use elsewhere

HuggingFace Hub — safetensors + GGUF files

huggingface.co/maracatu-ai/maracatu-20m

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("maracatu-ai/maracatu-20m", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("maracatu-ai/maracatu-20m")
model.eval()

inputs = tokenizer("O Brasil é", return_tensors="pt")
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=60, temperature=0.8, top_k=50, do_sample=True)
print(tokenizer.decode(out[0], skip_special_tokens=True))

GitHub — code, architecture, training scripts, docs

github.com/maracatu-ai/maracatu

Full source: model architecture, tokenizer training, corpus cleanup, export scripts, experiment logs.

Kaggle Models — checkpoint + metadata

kaggle.com/models/whereisanzi/maracatu-20m

Original training checkpoint and Kaggle kernel used for the T4 run.


License & citation

Code and weights are released under the Apache License 2.0.

Training data (Wikipedia PT) is licensed CC BY-SA 4.0 by the Wikimedia Foundation and contributors.

@misc{maracatu2026,
  author       = {Anzileiro, Anderson},
  title        = {Maracatu-20M: A Brazilian Portuguese Language Model Trained from Scratch},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/maracatu-ai/maracatu-20m}},
}

More

Full documentation, architecture decisions, experiment logs, and roadmap: github.com/maracatu-ai/maracatu

Maracatu-20M is the first step. The roadmap runs to Maracatu-70B, targeting Llama-3.1-70B performance on Portuguese benchmarks (ENEM, OAB, BLUEX, POSCOMP).