4 5 days ago

Schematron-3B is a 3.2B-parameter Llama-architecture chat model (inference-net/Schematron-3B) converted to GGUF for llama.cpp/Ollama, with two published quantizations: Q4_K_M (recommended balance) and IQ4_XS (smaller). It supports up to 131,072 context

5 days ago

afadbfff8e3e · 2.0GB ·

llama
·
3.21B
·
Q4_K_M
<|begin_of_text|>{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{
{ "stop": [ "<|eot_id|>" ] }

Readme

Schematron-3B (GGUF for Ollama)

## Overview

Schematron-3B is an instruction-tuned, Llama-architecture model published as inference-net/Schematron-3B. This repo provides GGUF builds for llama.cpp and packaged tags for Ollama.

## Key Features

  • Long-context capable (131K context metadata)
  • Small enough to run locally, with multiple quantizations
  • Uses the Llama 3 chat template (works well with standard Llama 3-style prompts)

## Available Versions

Recommended default is Q4_K_M (best quality/size balance). Use IQ4_XS if you need a smaller download.

| Tag | Size | Approx RAM* | Description | |—|—:|—:|—| | IQ4_XS | ~1.8 GB | ~3–4 GB + KV cache | Smaller / faster | | Q4_K_M | ~2.0 GB | ~3–4 GB + KV cache | Recommended |

*KV cache RAM depends heavily on your configured context window (num_ctx). See “System Requirements”.

## Quick Start

”`bash # Recommended ollama pull richardyoung/schematron-3b:Q4_K_M ollama run richardyoung/schematron-3b:Q4_K_M “Summarize this text in 5 bullets: …”

# Smaller ollama pull richardyoung/schematron-3b:iq4_xs ollama run richardyoung/schematron-3b:iq4_xs “Explain this error and propose a fix: …”

## Example Use Cases

### Long document Q&A

ollama run richardyoung/schematron-3b:Q4_K_M “Read this and answer questions:\n\n[paste doc here]”

### Planning and analysis

ollama run richardyoung/schematron-3b:Q4_K_M “Create a step-by-step plan to refactor this module:\n\n[paste code here]”

### Structured outputs

ollama run richardyoung/schematron-3b:Q4_K_M –format json “Return a JSON object with keys {title, summary, risks} for:\n\n[paste text here]”

## Prompt Format / Templates

This model is packaged with a Llama 3-style chat template (special tokens like <|start_header_id|> / <|eot_id|>). If you create your own Ollama Modelfile, use the templates in:

  • modelfiles/Schematron-3B-Q4_K_M.Modelfile
  • modelfiles/Schematron-3B-IQ4_XS.Modelfile

Suggested sampling (from GGUF metadata):

  • temperature: 0.6
  • top_p: 0.9

## System Requirements (Practical)

### Weights (minimum)

  • Q4_K_M / IQ4_XS can run on a typical laptop/desktop (a few GB of RAM for weights + runtime overhead).

### Context window (the real memory driver)

KV cache memory grows roughly linearly with num_ctx. Even on a 3B model, very large contexts can require tens of GB of additional RAM. If you don’t need extreme context lengths, keep num_ctx modest (e.g. 8K–32K) for a much lower RAM footprint.

## License

This model is governed by the upstream licensing and terms of use.

The GGUF quantizations are derivative artifacts; you must comply with all upstream terms before redistribution or commercial use.

## Acknowledgments

  • inference-net for the fine-tune
  • Meta Llama for Llama 3.2
  • llama.cpp for GGUF conversion and quantization tools
  • Ollama for packaging and distribution