46 1 year ago

Upload a Sentence and Check if it has Harmful Content

7b
ollama run harishkumar56278/TextContentModerator

Models

View all →

Readme

Cybersecurity Text Moderation Model

This repository contains a Cybersecurity Content Moderation AI model designed for text-based content analysis. The model detects and classifies harmful or inappropriate text using Ollama.

Model Information

  • Base Model: wizardlm2
  • Task: Text moderation and classification
  • Integration: Designed for use with Ollama

Features

  • Classifies text into predefined categories of harmful content.
  • Provides confidence scores (0 to 1) for each classification.
  • Generates structured JSON output for seamless integration.
  • Ensures strict system behavior to focus only on moderation.

Moderation Categories

The model classifies text into the following categories:

  1. Hate Speech – Offensive, derogatory, or discriminatory language targeting race, religion, gender, ethnicity, disability, or other protected traits.
  2. Unparliamentary Language – Profanity, offensive slurs, or disrespectful speech violating acceptable decorum.
  3. Threats – Statements implying harm, violence, doxxing, or any form of intimidation.
  4. Suicidal Content – Mentions of self-harm, suicidal ideation, or encouragement of self-harm.
  5. Terrorism-Related Content – Support, promotion, planning, or justification of terrorist acts or extremist ideologies.
  6. Illegal Content – Discussions of unlawful activities such as fraud, identity theft, hacking, drug trafficking, or other crimes.
  7. Harassment – Cyberbullying, repeated targeting, intimidation, or abusive behavior towards individuals or groups.
  8. Misinformation – False, misleading, or manipulated content designed to deceive or mislead the public.
  9. Self-Harm Encouragement – Any content that promotes, glorifies, or normalizes self-harm or suicidal behavior.
  10. Sexual Exploitation & Child Safety Violations – Content that depicts, promotes, or facilitates child exploitation, non-consensual sexual acts, or abuse.
  11. Explicit & NSFW Content – Pornographic, sexual, or highly explicit material unsuitable for general audiences.
  12. Political Manipulation & Disinformation – Coordinated or deceptive attempts to influence public opinion, elections, or spread propaganda.
  13. Spam, Scams, & Fraud – Deceptive content intended for financial gain, including phishing, Ponzi schemes, and fraudulent offers.

If no harmful content is detected, the text is classified as Safe.

Model Configuration

FROM wizardlm2

# Set model parameters
PARAMETER temperature 0.2      
PARAMETER num_ctx 4096         
PARAMETER top_p 0.8            
PARAMETER repeat_penalty 1.2   

#Adapter

create an custom datas in gpt format
{
  "conversations": [
    {
      "role": "system",
      "content": "You are a content moderation assistant. Detect and classify harmful messages."
    },
    {
      "role": "user",
      "content": "Moderate and classify this message:\n\n\"' Waaaaittt niggah holllll'up ✋ these niggahs come to town start wife'n my old sluts 𐐤𐐤𐐤𐐪𐐪 .\""
    },
    {
      "role": "assistant",
      "content": "⚠️ Offensive Language"
    }
  ]
}

System Behavior

The model is configured to strictly analyze and classify text. It does not engage in discussions, explanations, or opinions.

  • If a user submits text, it is classified into one or more categories.
  • If no category applies, it is labeled as “Safe” with a “Not Harmful Content” verdict.
  • If the input format is incorrect, the model returns:
    
    {
    "error": "Invalid format. Provide content in quotes: \"Your text here\"."
    }
    

Example Outputs

Harmful Content

{
  "classification": {
    "hate_speech": {
      "confidence_score": 0.85,
      "justification": "Detected racial slurs targeting a community - 'nigga'"
    },
    "threats": {
      "confidence_score": 0.92,
      "justification": "Direct threat of violence detected - explaining why it was flagged."
    }
  },
  "max_confidence_category": "threats",
  "final_verdict": "Harmful Content",
  "safe_content": false
}

Safe Content

{
  "classification": {},
  "max_confidence_category": null,
  "final_verdict": "Not Harmful Content",
  "safe_content": true
}

Usage

To load the model into Ollama, ensure you have ollama installed and run:

ollama create text_moderation -f <path_to_model_file>

To use the model for moderation:

ollama run text_moderation "Your text here"

License

< Harish Kumar S , Email: harishkumar56278@gmail.com, Site: harish-nika.github.io >