Schematron-8B is an ~8B-parameter, Llama-architecture instruction model (inference-net/Schematron-8B) packaged for Ollama with long-context metadata (131K) and two GGUF quantizations: Q4_K_M (recommended) and IQ4_XS (smaller/faster).

Schematron-8B (GGUF for Ollama)

Overview

Schematron-8B is an instruction-tuned, Llama-architecture model published as inference-net/Schematron-8B. This repo provides GGUF builds for llama.cpp and packaged tags for Ollama.

Upstream model: https://huggingface.co/inference-net/Schematron-8B
Base model: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
Architecture: llama (Llama 3.1 family)
Parameters: ~8B
Max context (metadata): 131,072 tokens

Key Features

Long-context capable (131K context metadata)
Higher quality than 3B-class models at the cost of more compute
Uses the Llama 3 chat template (works well with standard Llama 3-style prompts)

Available Versions

Recommended default is Q4_K_M (best quality/size balance). Use IQ4_XS if you need a smaller download.

Tag	Size	Approx RAM*	Description
`IQ4_XS`	~4.5 GB	~7–10 GB + KV cache	Smaller / faster
`Q4_K_M`	~4.9 GB	~7–10 GB + KV cache	Recommended

*KV cache RAM depends heavily on your configured context window (num_ctx). See “System Requirements”.

Quick Start

# Recommended
ollama pull richardyoung/schematron-8b:Q4_K_M
ollama run  richardyoung/schematron-8b:Q4_K_M "Summarize this text in 5 bullets: ..."

# Smaller
ollama pull richardyoung/schematron-8b:iq4_xs
ollama run  richardyoung/schematron-8b:iq4_xs "Explain this error and propose a fix: ..."

Example Use Cases

Long document Q&A

ollama run richardyoung/schematron-8b:Q4_K_M "Read this and answer questions:\n\n[paste doc here]"

Planning and analysis

ollama run richardyoung/schematron-8b:Q4_K_M "Create a step-by-step plan to refactor this module:\n\n[paste code here]"

Structured outputs

ollama run richardyoung/schematron-8b:Q4_K_M --format json "Return a JSON object with keys {title, summary, risks} for:\n\n[paste text here]"

Prompt Format / Templates

This model is packaged with a Llama 3-style chat template (special tokens like <|start_header_id|> / <|eot_id|>). If you create your own Ollama Modelfile, use the templates in:

modelfiles/Schematron-8B-Q4_K_M.Modelfile
modelfiles/Schematron-8B-IQ4_XS.Modelfile

Suggested sampling (from GGUF metadata):

temperature: 0.6
top_p: 0.9

System Requirements (Practical)

Weights (minimum)

Q4_K_M / IQ4_XS need ~5GB of storage for weights, and typically ~8GB+ RAM once you include runtime overhead.

Context window (the real memory driver)

KV cache memory grows roughly linearly with num_ctx. Very large contexts can require tens of GB of additional RAM. If you don’t need extreme context lengths, keep num_ctx modest (e.g. 8K–32K) for a much lower RAM footprint.

License

This model is governed by the upstream licensing and terms of use.

Base model license: Meta Llama 3.1 Community License Agreement (see upstream repo)
Upstream model (fine-tune): https://huggingface.co/inference-net/Schematron-8B
Base model license (Meta Llama 3.1): https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

The GGUF quantizations are derivative artifacts; you must comply with all upstream terms before redistribution or commercial use.

Acknowledgments

inference-net for the fine-tune
Meta Llama for Llama 3.1
llama.cpp for GGUF conversion and quantization tools
Ollama for packaging and distribution