114 Downloads Updated 1 month ago
The IBM Granite Guardian 3.3 models are designed to detect risks in prompts and/or responses. They can help with risk detection along many key dimensions catalogued in the IBM AI Risk Atlas. They are trained on unique data comprising human annotations and synthetic data informed by internal red-teaming, and they outperform other open-source models in the same space on standard benchmarks.
Granite Guardian 3.3 8b is a specialized Granite 3.3 8B model designed to judge if the input prompts and the output responses of an LLM based system meet specified criteria. The model comes pre-baked with certain criteria including but not limited to: jailbreak attempts, profanity, and hallucinations related to tool calls and retrieval augmented generation in agent-based systems. Additionally, the model also allows users to bring their own criteria and tailor the judging behavior to specific use-cases.
This version of Granite Guardian is a hybrid thinking model that allows the user to operate in thinking or non-thinking model. In thinking mode, the model produces detailed reasoning traces though … and … tags. In non-thinking mode, the model only produces the judgement score though the … tags.
It is trained on unique data comprising human annotations and synthetic data informed by internal red-teaming. It outperforms other open-source models in the same space on standard benchmarks.
ollama run ibm/granite3.3-guardian:8b >>> /set system violence
Risk detection in prompt text or model response (i.e. as guardrails), such as:
harm
): content considered generally harmful.
social_bias
): prejudice based on identity or characteristicsjailbreak
): deliberate instances of manipulating AI to generate harmful, undesired, or inappropriate contentviolence
): content promoting physical, mental, or sexual harmprofanity
): use of offensive language or insultssexual_content
): explicit or suggestive material of a sexual natureunethical_behavior
): actions that violate moral or legal standardsRAG (retrieval-augmented generation) to assess:
relevance
): whether the retrieved context is relevant to the querygroundedness
): whether the response is accurate and faithful to the provided contextanswer_relevance
): whether the response directly addresses the user’s queryAgentic Workflows to assess:
function_calling
) : validates use of function calls for syntactic and semantic hallucination.To enable “thinking” for this model, follow the tooling-specific instructions below.
From the Python chat client, which is part of the Ollama Python Library (i.e., https://github.com/ollama/ollama-python), use the think
keyword argument with value True
:
model = "ibm/granite3.3-guardian:8b"
messages = [
{
"role": "user",
"content": "hello world",
},
]
# Note: for Guardian models, the "temperature" parameter must be set to zero (0) to assure accurate assessment and scoring.
response = ollama.chat(
model=model,
think=True,
messages=messages,
options={"temperature": 0}
)
Use the flag optional boolean flag --think
with value set to true
. In addition, the specific system
type can be used with the /set
command once the client is started with the supported values listed above (e.g., harm
violence
, etc.)
$ ollama run ibm/granite3.3-guardian:8b --think=true
>>> /set system harm
Set system message.
>>> You suck
Thinking...
To determine whether the provided text "you suck" is universally harmful, I need to analyze it in the
context of potential harm across all contexts and audiences. The term "suck" is generally considered
derogatory and can be offensive or hurtful to individuals. However, the level of harm it poses can vary
depending on the context in which it is used. For instance, in a casual conversation among friends, it might
be seen as playful banter, while in a professional setting, it would be inappropriate and harmful.
...
...done thinking.
<score> yes </score>
where a yes
value within the <score>
delimiter means the last user message was considered “harmful”.