2 days ago

Specialized models for extracting structured data from HTML

3b

2 days ago

f500e7a023f8 · 6.4GB

llama
·
3.21B
·
F16
{{- if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|> {{- end }} <|sta
You are a helpful assistant.
{ "temperature": 0 }

Readme

Schematron is a fine-tuned model for HTML to JSON extraction built on Llama 3.2–3B. Give it a JSON Schema and a page’s HTML; it returns structured data. Designed for reliable, schema-valid outputs with temperature 0.

IMPORTANT — Call Pattern

Our Modelfile template expects two user messages:

  1. Message 0 = the JSON Schema (as text)
  2. Message 1 = the cleaned HTML (as text)

If you don’t send both, your request will fail or produce empty inserts.


Defaults

  • Temperature: 0
  • Input expectation: pass both schema and cleaned HTML (see below)

Quickstart (cURL)

curl -s http://localhost:11434/api/chat \
  -H "content-type: application/json" \
  -d '{
    "model": "Inference/Schematron:3B",
    "stream": false,
    "messages": [
      { "role": "user", "content": "{ \"type\": \"object\", \"properties\": { \"title\": {\"type\":\"string\"}, \"date\": {\"type\":\"string\"}, \"price\": {\"type\":\"number\"}, \"tags\": {\"type\":\"array\",\"items\":{\"type\":\"string\"}} }, \"required\": [\"title\"] }" },
      { "role": "user", "content": "<!doctype html><html><head><title>ModelFest — Oct 5, 2025</title></head><body><h1>ModelFest</h1><p>Date: Oct 5, 2025</p><p>Tickets from $129.99</p><ul><li>ai</li><li>conference</li></ul></body></html>" }
    ],
    "format": {
      "type": "object",
      "properties": {
        "title": {"type":"string"},
        "date": {"type":"string"},
        "price": {"type":"number"},
        "tags": {"type":"array","items":{"type":"string"}}
      },
      "required": ["title"]
    },
    "options": { "temperature": 0 }
  }'

Response (example)

{
  "title": "ModelFest",
  "date": "Oct 5, 2025",
  "price": 129.99,
  "tags": ["ai", "conference"]
}

Python (with structured outputs)

from ollama import chat
import json

# Response schema you want BACK (recommended)
OutputSchema = {
  "type": "object",
  "properties": {
    "title": {"type":"string"},
    "date": {"type":"string"},
    "price": {"type":"number"},
    "tags": {"type":"array","items":{"type":"string"}}
  },
  "required": ["title"]
}

schema_text_for_prompt = json.dumps(OutputSchema, ensure_ascii=False)

html = """
<!doctype html><html><head><title>ModelFest — Oct 5, 2025</title></head>
<body><h1>ModelFest</h1><p>Date: Oct 5, 2025</p><p>Tickets from $129.99</p>
<ul><li>ai</li><li>conference</li></ul></body></html>
"""

resp = chat(
  model="Inference/Schematron:3B",
  stream=False,
  messages=[
    {"role": "user", "content": schema_text_for_prompt},  # Message 0 = SCHEMA
    {"role": "user", "content": html}                     # Message 1 = HTML
  ],
  options={"temperature": 0}
)

print(resp.message.content)       # JSON string conforming to OutputSchema

JavaScript (Node)

import ollama from 'ollama';

const OutputSchema = {
  type: 'object',
  properties: {
    title: { type: 'string' },
    date: { type: 'string' },
    price: { type: 'number' },
    tags: { type: 'array', items: { type: 'string' } },
  },
  required: ['title'],
};

const schemaTextForPrompt = JSON.stringify(OutputSchema);
const html = `
<!doctype html><html><head><title>ModelFest — Oct 5, 2025</title></head>
<body><h1>ModelFest</h1><p>Date: Oct 5, 2025</p><p>Tickets from $129.99</p>
<ul><li>ai</li><li>conference</li></ul></body></html>
`;

const resp = await ollama.chat({
  model: 'Inference/Schematron:3B',
  stream: false,
  messages: [
    { role: 'user', content: schemaTextForPrompt }, // Message 0 = SCHEMA
    { role: 'user', content: html },                // Message 1 = HTML
  ],
  options: { temperature: 0 },
});

console.log(resp.message.content);

Clean your HTML (recommended)

We recommend you clean your html before submitting to the model. The training data used lxml, but any library should work–as long as you don’t clean so aggressively that you lose the information relevant to your extraction.

from lxml.html.clean import Cleaner
import lxml.html as LH

HTML_CLEANER = Cleaner(
    scripts=True, javascript=True, style=True, inline_style=True, safe_attrs_only=False
)

def strip_noise(raw_html: str) -> str:
    if not raw_html or not raw_html.strip():
        return ""
    try:
        doc = LH.fromstring(raw_html)
        cleaned = HTML_CLEANER.clean_html(doc)
        return LH.tostring(cleaned, encoding="unicode")
    except Exception:
        return ""

Use strip_noise(html) before sending Message 1.


Output schema (example)

If you don’t already have a target schema, here’s a starter:

{
  "type": "object",
  "properties": {
    "title": { "type": "string" },
    "date": { "type": "string" },
    "price": { "type": "number" },
    "tags": { "type": "array", "items": { "type": "string" } }
  },
  "required": ["title"]
}

Troubleshooting

  • HTTP 500 / index of untyped nil Always send two user messages (schema first, cleaned HTML second).
  • HTML noise Clean with lxml or a library of your choice (see snippet).

License

Llama 3.2 license (downstream usage subject to base model terms).


Links

Pull & run:

ollama pull Inference/Schematron:3B
ollama run Inference/Schematron:3B