116 Downloads Updated 1 month ago
Schematron is a fine-tuned model for HTML to JSON extraction built on Llama 3.2–3B. Give it a JSON Schema and a page’s HTML; it returns structured data. Designed for reliable, schema-valid outputs with temperature 0.
IMPORTANT — Local vs. Serverless Usage
When running locally (Ollama): Use the specific prompt format shown below for best results.
When using the serverless API: No prompt formatting needed—just pass your HTML and schema; we handle it for you.
Learn More: - Schematron Announcement Blog - Serverless API
0curl -s http://localhost:11434/api/chat \
-H "content-type: application/json" \
-d '{
"model": "Inference/Schematron:3B",
"stream": false,
"messages": [
{ "role": "system", "content": "You are a helpful assistant" },
{ "role": "user", "content": "You are going to be given a JSON schema following the standardized JSON Schema format. You are going to be given a HTML page and you are going to apply the schema to the HTML page however you see it as applicable and return the results in a JSON object. The schema is as follows:\n\n{ \"type\": \"object\", \"properties\": { \"title\": {\"type\":\"string\"}, \"date\": {\"type\":\"string\"}, \"price\": {\"type\":\"number\"}, \"tags\": {\"type\":\"array\",\"items\":{\"type\":\"string\"}} }, \"required\": [\"title\"] }\n\nHere is the HTML page:\n\n<!doctype html><html><head><title>ModelFest — Oct 5, 2025</title></head><body><h1>ModelFest</h1><p>Date: Oct 5, 2025</p><p>Tickets from $129.99</p><ul><li>ai</li><li>conference</li></ul></body></html>\n\nMAKE SURE ITS VALID JSON." }
],
"format": {
"type": "object",
"properties": {
"title": {"type":"string"},
"date": {"type":"string"},
"price": {"type":"number"},
"tags": {"type":"array","items":{"type":"string"}}
},
"required": ["title"]
},
"options": { "temperature": 0 }
}'
Response (example)
{
"title": "ModelFest",
"date": "Oct 5, 2025",
"price": 129.99,
"tags": ["ai", "conference"]
}
from ollama import chat
import json
# Response schema you want BACK (recommended)
OutputSchema = {
"type": "object",
"properties": {
"title": {"type":"string"},
"date": {"type":"string"},
"price": {"type":"number"},
"tags": {"type":"array","items":{"type":"string"}}
},
"required": ["title"]
}
schema_text = json.dumps(OutputSchema, ensure_ascii=False)
html = """
<!doctype html><html><head><title>ModelFest — Oct 5, 2025</title></head>
<body><h1>ModelFest</h1><p>Date: Oct 5, 2025</p><p>Tickets from $129.99</p>
<ul><li>ai</li><li>conference</li></ul></body></html>
"""
# Construct the prompt
user_prompt = (
"You are going to be given a JSON schema following the standardized JSON Schema format. "
"You are going to be given a HTML page and you are going to apply the schema to the HTML "
"page however you see it as applicable and return the results in a JSON object. "
"The schema is as follows:\n\n"
f"{schema_text}\n\n"
"Here is the HTML page:\n\n"
f"{html}\n\n"
"MAKE SURE ITS VALID JSON."
)
resp = chat(
model="Inference/Schematron:3B",
stream=False,
messages=[
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": user_prompt}
],
options={"temperature": 0}
)
print(resp.message.content) # JSON string conforming to OutputSchema
import ollama from 'ollama';
const OutputSchema = {
type: 'object',
properties: {
title: { type: 'string' },
date: { type: 'string' },
price: { type: 'number' },
tags: { type: 'array', items: { type: 'string' } },
},
required: ['title'],
};
const schemaText = JSON.stringify(OutputSchema);
const html = `
<!doctype html><html><head><title>ModelFest — Oct 5, 2025</title></head>
<body><h1>ModelFest</h1><p>Date: Oct 5, 2025</p><p>Tickets from $129.99</p>
<ul><li>ai</li><li>conference</li></ul></body></html>
`;
// Construct the prompt
const userPrompt =
`You are going to be given a JSON schema following the standardized JSON Schema format. ` +
`You are going to be given a HTML page and you are going to apply the schema to the HTML ` +
`page however you see it as applicable and return the results in a JSON object. ` +
`The schema is as follows:\n\n${schemaText}\n\n` +
`Here is the HTML page:\n\n${html}\n\n` +
`MAKE SURE ITS VALID JSON.`;
const resp = await ollama.chat({
model: 'Inference/Schematron:3B',
stream: false,
messages: [
{ role: 'system', content: 'You are a helpful assistant' },
{ role: 'user', content: userPrompt },
],
options: { temperature: 0 },
});
console.log(resp.message.content);
We recommend you clean your html before submitting to the model. The training data used lxml, but any library should work–as long as you don’t clean so aggressively that you lose the information relevant to your extraction.
from lxml.html.clean import Cleaner
import lxml.html as LH
HTML_CLEANER = Cleaner(
scripts=True, javascript=True, style=True, inline_style=True, safe_attrs_only=False
)
def strip_noise(raw_html: str) -> str:
if not raw_html or not raw_html.strip():
return ""
try:
doc = LH.fromstring(raw_html)
cleaned = HTML_CLEANER.clean_html(doc)
return LH.tostring(cleaned, encoding="unicode")
except Exception:
return ""
Use strip_noise(html) before including the HTML in your prompt.
If you don’t already have a target schema, here’s a starter:
{
"type": "object",
"properties": {
"title": { "type": "string" },
"date": { "type": "string" },
"price": { "type": "number" },
"tags": { "type": "array", "items": { "type": "string" } }
},
"required": ["title"]
}
lxml or a library of your choice (see snippet). You may see issues if your page is too long, if it is it’s recommended to clean more aggressively or to truncate the page.Llama 3.2 license (downstream usage subject to base model terms).
Pull & run:
ollama pull Inference/Schematron:3B ollama run Inference/Schematron:3B