Schematron-3B is a 3.2B-parameter Llama-architecture chat model (inference-net/Schematron-3B) converted to GGUF for llama.cpp/Ollama, with two published quantizations: Q4_K_M (recommended balance) and IQ4_XS (smaller). It supports up to 131,072 context

Schematron-3B (GGUF for Ollama)

## Overview

Schematron-3B is an instruction-tuned, Llama-architecture model published as inference-net/Schematron-3B. This repo provides GGUF builds for llama.cpp and packaged tags for Ollama.

Upstream model: https://huggingface.co/inference-net/Schematron-3B
Base model: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
Architecture: llama (Llama 3.2 family)
Parameters: ~3.2B
Max context (metadata): 131,072 tokens

## Key Features

Long-context capable (131K context metadata)
Small enough to run locally, with multiple quantizations
Uses the Llama 3 chat template (works well with standard Llama 3-style prompts)

## Available Versions

Recommended default is Q4_K_M (best quality/size balance). Use IQ4_XS if you need a smaller download.

| Tag | Size | Approx RAM* | Description | |—|—:|—:|—| | IQ4_XS | ~1.8 GB | ~3–4 GB + KV cache | Smaller / faster | | Q4_K_M | ~2.0 GB | ~3–4 GB + KV cache | Recommended |

*KV cache RAM depends heavily on your configured context window (num_ctx). See “System Requirements”.

## Quick Start

”`bash # Recommended ollama pull richardyoung/schematron-3b:Q4_K_M ollama run richardyoung/schematron-3b:Q4_K_M “Summarize this text in 5 bullets: …”

# Smaller ollama pull richardyoung/schematron-3b:iq4_xs ollama run richardyoung/schematron-3b:iq4_xs “Explain this error and propose a fix: …”

## Example Use Cases

### Long document Q&A

ollama run richardyoung/schematron-3b:Q4_K_M “Read this and answer questions:\n\n[paste doc here]”

### Planning and analysis

ollama run richardyoung/schematron-3b:Q4_K_M “Create a step-by-step plan to refactor this module:\n\n[paste code here]”

### Structured outputs

ollama run richardyoung/schematron-3b:Q4_K_M –format json “Return a JSON object with keys {title, summary, risks} for:\n\n[paste text here]”

## Prompt Format / Templates

This model is packaged with a Llama 3-style chat template (special tokens like <|start_header_id|> / <|eot_id|>). If you create your own Ollama Modelfile, use the templates in:

modelfiles/Schematron-3B-Q4_K_M.Modelfile
modelfiles/Schematron-3B-IQ4_XS.Modelfile

Suggested sampling (from GGUF metadata):

temperature: 0.6
top_p: 0.9

## System Requirements (Practical)

### Weights (minimum)

Q4_K_M / IQ4_XS can run on a typical laptop/desktop (a few GB of RAM for weights + runtime overhead).

### Context window (the real memory driver)

KV cache memory grows roughly linearly with num_ctx. Even on a 3B model, very large contexts can require tens of GB of additional RAM. If you don’t need extreme context lengths, keep num_ctx modest (e.g. 8K–32K) for a much lower RAM footprint.

## License

This model is governed by the upstream licensing and terms of use.

Base model license: Meta Llama 3.2 Community License Agreement (see upstream repo)
Upstream model (fine-tune): https://huggingface.co/inference-net/Schematron-3B
Base model license (Meta Llama 3.2): https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct

The GGUF quantizations are derivative artifacts; you must comply with all upstream terms before redistribution or commercial use.

## Acknowledgments

inference-net for the fine-tune
Meta Llama for Llama 3.2
llama.cpp for GGUF conversion and quantization tools
Ollama for packaging and distribution

Schematron-3B is a 3.2B-parameter Llama-architecture chat model (inference-net/Schematron-3B) converted to GGUF for llama.cpp/Ollama, with two published quantizations: Q4_K_M (recommended balance) and IQ4_XS (smaller). It supports up to 131,072 context

Models

Readme

Schematron-3B (GGUF for Ollama)