SigmaAI — 80M (Test Run)

A 80M parameter language model trained from scratch on a personal AMD GPU. This is my first test model — the full 221M version is currently in training.

About

SigmaAI-80M is the first test model I trained entirely from scratch — no pre-trained base, no fine-tuning on top of someone else’s weights. Every parameter was initialized randomly and learned from raw text data.

This model was primarily a proof of concept to validate the training pipeline before committing to the full 221M run. It showed the stack worked end-to-end: custom tokenizer, data pipeline, training loop, checkpointing, and GGUF export all functioning correctly on consumer AMD hardware.

The full SigmaAI 221M model is currently in training on the same hardware.

Model Details

Property	Value
Parameters	~80M
Architecture	Transformer (LLaMA-style)
Vocabulary size	32,000
Tokenizer	Custom BPE (trained on the same corpus)
Attention	Multi-Head Attention with RoPE
FFN	SwiGLU
Normalization	RMSNorm
Precision	bfloat16 during training

Training

Property	Value
Hardware	ASRock AMD Radeon RX 9060 XT Challenger 16GB OC
Framework	PyTorch + ROCm 7.2
Optimizer	Fused AdamW
Learning rate	2e-4 with cosine warmup
Mixed precision	bfloat16 (AMP)
Compiled	Yes — torch.compile()

The entire training stack was written from scratch, including: - A custom BPE tokenizer trained on the corpus - A binary token cache for fast data loading - A background prefetch thread to keep the GPU saturated - An auto-restart launcher that resumes from checkpoints on any crash - Gradient checkpointing to fit larger batches in VRAM

Training Data

Trained on a personal collection of text data including various JSON, JSONL, and plain text files — roughly 2.34 billion tokens total. The tokenizer was trained on a representative sample of the same corpus.

Usage

ollama run ermwhatesigma420/sigmaAI:80M

Or with the API:

curl http://localhost:11434/api/generate -d '{
  "model": "ermwhatesigma420/sigmaAI:80M",
  "prompt": "Once upon a time",
  "stream": false
}'

Recommended parameters

temperature     0.7     # good balance of creativity vs coherence
repeat_penalty  1.1     # reduces repetitive output

Limitations

This is a test model. It was trained to validate the pipeline, not to be a production-quality assistant.
At 80M parameters it is capable of coherent text generation but will struggle with complex reasoning, long-range context, and factual accuracy.
It does not have instruction tuning or RLHF — it is a base language model that continues text rather than following instructions.
Knowledge is limited to whatever was in the training corpus.
Like all language models it can generate plausible-sounding but incorrect information.

What Comes Next

This model was step one. The full SigmaAI 221M is currently training on the same hardware with: - 12 layers, d_model=1024, 16 attention heads - ~2.34 billion training tokens - Flash Attention enabled via ROCm 7.2 + AOTriton - The same fully custom training stack, now optimized for speed

Why I Built This

I wanted to understand what it actually takes to train a language model end-to-end — not fine-tune an existing one, not run someone else’s weights, but build the entire thing from the ground up. Every component — the tokenizer, the model architecture, the training loop, the data pipeline — was written and debugged by hand on consumer hardware.

This project is proof that training a real transformer language model does not require a data center. It requires patience, a good GPU, and a lot of debugging.

License

This model was trained and released for personal and research use.
Mostly an school project.

My first ever made self model and trained

Details

Readme