Cybersecurity Text Moderation Model

This repository contains a Cybersecurity Content Moderation AI model designed for text-based content analysis. The model detects and classifies harmful or inappropriate text using Ollama.

Model Information

Base Model: wizardlm2
Task: Text moderation and classification
Integration: Designed for use with Ollama

Features

Classifies text into predefined categories of harmful content.
Provides confidence scores (0 to 1) for each classification.
Generates structured JSON output for seamless integration.
Ensures strict system behavior to focus only on moderation.

Moderation Categories

The model classifies text into the following categories:

Hate Speech – Offensive, derogatory, or discriminatory language targeting race, religion, gender, ethnicity, disability, or other protected traits.
Unparliamentary Language – Profanity, offensive slurs, or disrespectful speech violating acceptable decorum.
Threats – Statements implying harm, violence, doxxing, or any form of intimidation.
Suicidal Content – Mentions of self-harm, suicidal ideation, or encouragement of self-harm.
Terrorism-Related Content – Support, promotion, planning, or justification of terrorist acts or extremist ideologies.
Illegal Content – Discussions of unlawful activities such as fraud, identity theft, hacking, drug trafficking, or other crimes.
Harassment – Cyberbullying, repeated targeting, intimidation, or abusive behavior towards individuals or groups.
Misinformation – False, misleading, or manipulated content designed to deceive or mislead the public.
Self-Harm Encouragement – Any content that promotes, glorifies, or normalizes self-harm or suicidal behavior.
Sexual Exploitation & Child Safety Violations – Content that depicts, promotes, or facilitates child exploitation, non-consensual sexual acts, or abuse.
Explicit & NSFW Content – Pornographic, sexual, or highly explicit material unsuitable for general audiences.
Political Manipulation & Disinformation – Coordinated or deceptive attempts to influence public opinion, elections, or spread propaganda.
Spam, Scams, & Fraud – Deceptive content intended for financial gain, including phishing, Ponzi schemes, and fraudulent offers.

If no harmful content is detected, the text is classified as Safe.

Model Configuration

FROM wizardlm2

# Set model parameters
PARAMETER temperature 0.2      
PARAMETER num_ctx 4096         
PARAMETER top_p 0.8            
PARAMETER repeat_penalty 1.2

#Adapter

create an custom datas in gpt format
{
  "conversations": [
    {
      "role": "system",
      "content": "You are a content moderation assistant. Detect and classify harmful messages."
    },
    {
      "role": "user",
      "content": "Moderate and classify this message:\n\n\"' Waaaaittt niggah holllll'up ✋ these niggahs come to town start wife'n my old sluts 𐐤𐐤𐐤𐐪𐐪 .\""
    },
    {
      "role": "assistant",
      "content": "⚠️ Offensive Language"
    }
  ]
}

System Behavior

The model is configured to strictly analyze and classify text. It does not engage in discussions, explanations, or opinions.

If a user submits text, it is classified into one or more categories.
If no category applies, it is labeled as “Safe” with a “Not Harmful Content” verdict.

If the input format is incorrect, the model returns:


{
"error": "Invalid format. Provide content in quotes: \"Your text here\"."
}

Example Outputs

Harmful Content

{
  "classification": {
    "hate_speech": {
      "confidence_score": 0.85,
      "justification": "Detected racial slurs targeting a community - 'nigga'"
    },
    "threats": {
      "confidence_score": 0.92,
      "justification": "Direct threat of violence detected - explaining why it was flagged."
    }
  },
  "max_confidence_category": "threats",
  "final_verdict": "Harmful Content",
  "safe_content": false
}

Safe Content

{
  "classification": {},
  "max_confidence_category": null,
  "final_verdict": "Not Harmful Content",
  "safe_content": true
}

Usage

To load the model into Ollama, ensure you have ollama installed and run:

ollama create text_moderation -f <path_to_model_file>

To use the model for moderation:

ollama run text_moderation "Your text here"

License

< Harish Kumar S , Email: harishkumar56278@gmail.com, Site: harish-nika.github.io >

Upload a Sentence and Check if it has Harmful Content

Models

Readme