7 5 days ago

Schematron-8B is an ~8B-parameter, Llama-architecture instruction model (inference-net/Schematron-8B) packaged for Ollama with long-context metadata (131K) and two GGUF quantizations: Q4_K_M (recommended) and IQ4_XS (smaller/faster).

5 days ago

e3b95d59db8d · 4.9GB ·

llama
·
8.03B
·
Q4_K_M
<|begin_of_text|>{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{
{ "stop": [ "<|eot_id|>" ] }

Readme

Schematron-8B (GGUF for Ollama)

Overview

Schematron-8B is an instruction-tuned, Llama-architecture model published as inference-net/Schematron-8B. This repo provides GGUF builds for llama.cpp and packaged tags for Ollama.

Key Features

  • Long-context capable (131K context metadata)
  • Higher quality than 3B-class models at the cost of more compute
  • Uses the Llama 3 chat template (works well with standard Llama 3-style prompts)

Available Versions

Recommended default is Q4_K_M (best quality/size balance). Use IQ4_XS if you need a smaller download.

Tag Size Approx RAM* Description
IQ4_XS ~4.5 GB ~7–10 GB + KV cache Smaller / faster
Q4_K_M ~4.9 GB ~7–10 GB + KV cache Recommended

*KV cache RAM depends heavily on your configured context window (num_ctx). See “System Requirements”.

Quick Start

# Recommended
ollama pull richardyoung/schematron-8b:Q4_K_M
ollama run  richardyoung/schematron-8b:Q4_K_M "Summarize this text in 5 bullets: ..."

# Smaller
ollama pull richardyoung/schematron-8b:iq4_xs
ollama run  richardyoung/schematron-8b:iq4_xs "Explain this error and propose a fix: ..."

Example Use Cases

Long document Q&A

ollama run richardyoung/schematron-8b:Q4_K_M "Read this and answer questions:\n\n[paste doc here]"

Planning and analysis

ollama run richardyoung/schematron-8b:Q4_K_M "Create a step-by-step plan to refactor this module:\n\n[paste code here]"

Structured outputs

ollama run richardyoung/schematron-8b:Q4_K_M --format json "Return a JSON object with keys {title, summary, risks} for:\n\n[paste text here]"

Prompt Format / Templates

This model is packaged with a Llama 3-style chat template (special tokens like <|start_header_id|> / <|eot_id|>). If you create your own Ollama Modelfile, use the templates in:

  • modelfiles/Schematron-8B-Q4_K_M.Modelfile
  • modelfiles/Schematron-8B-IQ4_XS.Modelfile

Suggested sampling (from GGUF metadata):

  • temperature: 0.6
  • top_p: 0.9

System Requirements (Practical)

Weights (minimum)

  • Q4_K_M / IQ4_XS need ~5GB of storage for weights, and typically ~8GB+ RAM once you include runtime overhead.

Context window (the real memory driver)

KV cache memory grows roughly linearly with num_ctx. Very large contexts can require tens of GB of additional RAM. If you don’t need extreme context lengths, keep num_ctx modest (e.g. 8K–32K) for a much lower RAM footprint.

License

This model is governed by the upstream licensing and terms of use.

The GGUF quantizations are derivative artifacts; you must comply with all upstream terms before redistribution or commercial use.

Acknowledgments

  • inference-net for the fine-tune
  • Meta Llama for Llama 3.1
  • llama.cpp for GGUF conversion and quantization tools
  • Ollama for packaging and distribution