116 1 month ago

Specialized models for extracting structured data from HTML

3b

Models

View all →

Readme

Schematron is a fine-tuned model for HTML to JSON extraction built on Llama 3.2–3B. Give it a JSON Schema and a page’s HTML; it returns structured data. Designed for reliable, schema-valid outputs with temperature 0.

IMPORTANT — Local vs. Serverless Usage

When running locally (Ollama): Use the specific prompt format shown below for best results.

When using the serverless API: No prompt formatting needed—just pass your HTML and schema; we handle it for you.

Learn More: - Schematron Announcement Blog - Serverless API


Defaults (Local Usage)

  • Temperature: 0
  • Prompt format: Use the structured prompt template shown in examples below
  • Messages: System message + single user message with schema and HTML combined

Quickstart (cURL)

curl -s http://localhost:11434/api/chat \
  -H "content-type: application/json" \
  -d '{
    "model": "Inference/Schematron:3B",
    "stream": false,
    "messages": [
      { "role": "system", "content": "You are a helpful assistant" },
      { "role": "user", "content": "You are going to be given a JSON schema following the standardized JSON Schema format. You are going to be given a HTML page and you are going to apply the schema to the HTML page however you see it as applicable and return the results in a JSON object. The schema is as follows:\n\n{ \"type\": \"object\", \"properties\": { \"title\": {\"type\":\"string\"}, \"date\": {\"type\":\"string\"}, \"price\": {\"type\":\"number\"}, \"tags\": {\"type\":\"array\",\"items\":{\"type\":\"string\"}} }, \"required\": [\"title\"] }\n\nHere is the HTML page:\n\n<!doctype html><html><head><title>ModelFest — Oct 5, 2025</title></head><body><h1>ModelFest</h1><p>Date: Oct 5, 2025</p><p>Tickets from $129.99</p><ul><li>ai</li><li>conference</li></ul></body></html>\n\nMAKE SURE ITS VALID JSON." }
    ],
    "format": {
      "type": "object",
      "properties": {
        "title": {"type":"string"},
        "date": {"type":"string"},
        "price": {"type":"number"},
        "tags": {"type":"array","items":{"type":"string"}}
      },
      "required": ["title"]
    },
    "options": { "temperature": 0 }
  }'

Response (example)

{
  "title": "ModelFest",
  "date": "Oct 5, 2025",
  "price": 129.99,
  "tags": ["ai", "conference"]
}

Python (with structured outputs)

from ollama import chat
import json

# Response schema you want BACK (recommended)
OutputSchema = {
  "type": "object",
  "properties": {
    "title": {"type":"string"},
    "date": {"type":"string"},
    "price": {"type":"number"},
    "tags": {"type":"array","items":{"type":"string"}}
  },
  "required": ["title"]
}

schema_text = json.dumps(OutputSchema, ensure_ascii=False)

html = """
<!doctype html><html><head><title>ModelFest — Oct 5, 2025</title></head>
<body><h1>ModelFest</h1><p>Date: Oct 5, 2025</p><p>Tickets from $129.99</p>
<ul><li>ai</li><li>conference</li></ul></body></html>
"""

# Construct the prompt
user_prompt = (
    "You are going to be given a JSON schema following the standardized JSON Schema format. "
    "You are going to be given a HTML page and you are going to apply the schema to the HTML "
    "page however you see it as applicable and return the results in a JSON object. "
    "The schema is as follows:\n\n"
    f"{schema_text}\n\n"
    "Here is the HTML page:\n\n"
    f"{html}\n\n"
    "MAKE SURE ITS VALID JSON."
)

resp = chat(
  model="Inference/Schematron:3B",
  stream=False,
  messages=[
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": user_prompt}
  ],
  options={"temperature": 0}
)

print(resp.message.content)       # JSON string conforming to OutputSchema

JavaScript (Node)

import ollama from 'ollama';

const OutputSchema = {
  type: 'object',
  properties: {
    title: { type: 'string' },
    date: { type: 'string' },
    price: { type: 'number' },
    tags: { type: 'array', items: { type: 'string' } },
  },
  required: ['title'],
};

const schemaText = JSON.stringify(OutputSchema);
const html = `
<!doctype html><html><head><title>ModelFest — Oct 5, 2025</title></head>
<body><h1>ModelFest</h1><p>Date: Oct 5, 2025</p><p>Tickets from $129.99</p>
<ul><li>ai</li><li>conference</li></ul></body></html>
`;

// Construct the prompt
const userPrompt = 
  `You are going to be given a JSON schema following the standardized JSON Schema format. ` +
  `You are going to be given a HTML page and you are going to apply the schema to the HTML ` +
  `page however you see it as applicable and return the results in a JSON object. ` +
  `The schema is as follows:\n\n${schemaText}\n\n` +
  `Here is the HTML page:\n\n${html}\n\n` +
  `MAKE SURE ITS VALID JSON.`;

const resp = await ollama.chat({
  model: 'Inference/Schematron:3B',
  stream: false,
  messages: [
    { role: 'system', content: 'You are a helpful assistant' },
    { role: 'user', content: userPrompt },
  ],
  options: { temperature: 0 },
});

console.log(resp.message.content);

Clean your HTML (recommended)

We recommend you clean your html before submitting to the model. The training data used lxml, but any library should work–as long as you don’t clean so aggressively that you lose the information relevant to your extraction.

from lxml.html.clean import Cleaner
import lxml.html as LH

HTML_CLEANER = Cleaner(
    scripts=True, javascript=True, style=True, inline_style=True, safe_attrs_only=False
)

def strip_noise(raw_html: str) -> str:
    if not raw_html or not raw_html.strip():
        return ""
    try:
        doc = LH.fromstring(raw_html)
        cleaned = HTML_CLEANER.clean_html(doc)
        return LH.tostring(cleaned, encoding="unicode")
    except Exception:
        return ""

Use strip_noise(html) before including the HTML in your prompt.


Output schema (example)

If you don’t already have a target schema, here’s a starter:

{
  "type": "object",
  "properties": {
    "title": { "type": "string" },
    "date": { "type": "string" },
    "price": { "type": "number" },
    "tags": { "type": "array", "items": { "type": "string" } }
  },
  "required": ["title"]
}

Troubleshooting

  • HTTP 500 / Poor extraction quality Make sure to use the correct prompt format shown in the examples above (system message + structured user prompt).
  • HTML noise Clean with lxml or a library of your choice (see snippet). You may see issues if your page is too long, if it is it’s recommended to clean more aggressively or to truncate the page.

License

Llama 3.2 license (downstream usage subject to base model terms).


Links

Pull & run:

ollama pull Inference/Schematron:3B
ollama run Inference/Schematron:3B