Prometheus 2 (7B, Q4_K_M) — an LLM-as-judge model fine-tuned for absolute and pairwise grading of generative outputs.

Prometheus 2 (7B, Q4_K_M)

An LLM-as-judge model fine-tuned for absolute grading (1–5 score on a rubric) and pairwise ranking of generative outputs. Built from the official prometheus-eval/prometheus-7b-v2.0 weights (Apache-2.0) with the paper’s ABS_SYSTEM_PROMPT and the Mistral Instruct chat template baked in, so you can score a (question, response, reference) triple with a single user prompt.

Base model: Mistral 7B Instruct v0.2
Source GGUF: prometheus-eval/prometheus-7b-v2.0-GGUF · Apache-2.0
Paper: Prometheus 2 (Kim et al., 2024)
Quantization: Q4_K_M (~4.4 GB)

Built-in defaults

Setting	Value
`SYSTEM`	`"You are a fair judge assistant tasked with providing clear, objective feedback based on specific criteria, ensuring each assessment reflects the absolute standards set for performance."`
`TEMPLATE`	Mistral Instruct: `[INST] {system}\n\n{prompt} [/INST]`
`temperature`	0 (deterministic judging)
`num_ctx`	8192
`stop`	`[INST]`, `[/INST]`

Prompt format (absolute grading)

The model expects this user-message structure verbatim — Prometheus 2 was trained on it and degrades with other layouts:

###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: "(write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
{rubric}

###Feedback:

The {rubric} is a 5-point Likert description in the form:

[<one-sentence high-level criterion>]
Score 1: <description>
Score 2: <description>
Score 3: <description>
Score 4: <description>
Score 5: <description>

Parsing the output

The model returns free-form feedback ending in [RESULT] N. Extract with a regex:

import re
match = re.search(r"\[RESULT\]\s*([1-5])", output)
score = int(match.group(1)) if match else None

For binary pass/fail collapse, threshold ≥ 4 is the conventional cutoff.

Quick start

ollama run ggozad/prometheus2

Or via the API:

curl http://localhost:11434/api/generate -d '{
  "model": "ggozad/prometheus2",
  "prompt": "###Task Description:\nAn instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.\n...\n###Feedback:",
  "stream": false
}'

For Python integration, point any OpenAI-compatible client at http://localhost:11434/v1:

from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.ollama import OllamaProvider

agent = Agent(
    OpenAIChatModel(
        model_name="ggozad/prometheus2",
        provider=OllamaProvider(base_url="http://localhost:11434/v1"),
    ),
    output_type=str,
)
result = await agent.run(absolute_prompt)

When to use this model

✅ Calibrated rubric-based grading where you want a 1–5 score with feedback
✅ Pairwise comparison of two LLM outputs against a reference
✅ Local / on-prem evaluation pipelines (no third-party API)
⚠️ Binary equivalence judging — works, but a generic instruction-tuned LLM with a tailored binary rubric may agree more closely with human labels for narrow yes/no decisions
❌ General chat or open-ended generation — this is a judge, not an assistant

Modelfile

FROM <prometheus-7b-v2.0.Q4_K_M.gguf>

TEMPLATE """[INST] {{ if .System }}{{ .System }}

{{ end }}{{ .Prompt }} [/INST] """

SYSTEM """You are a fair judge assistant tasked with providing clear, objective feedback based on specific criteria, ensuring each assessment reflects the absolute standards set for performance."""

PARAMETER temperature 0
PARAMETER num_ctx 8192
PARAMETER stop "[INST]"
PARAMETER stop "[/INST]"

License & attribution

Released under Apache-2.0, the same license as the upstream weights. If you redistribute or fine-tune further, retain the attribution to:

Kim et al., Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models (2024)
The prometheus-eval team, who published the original weights