66 Downloads Updated 3 weeks ago
ollama run hherb/qwen3.5-9b-biasbuster5-q8_0
A fine-tuned Qwen3.5-9B model that detects bias in clinical trial abstracts across five evidence-based domains. Unlike general-purpose classifiers, BiasBuster produces step-by-step reasoning chains and recommends specific verification databases where each claim can be checked.
Given a clinical trial abstract, BiasBuster:
<think> tags| Domain | What It Catches |
|---|---|
| Statistical Reporting | Sole reliance on relative measures (RRR, OR, HR) without absolute measures (ARR, NNT, baseline risk) |
| Spin | Conclusions that overstate or misrepresent actual results (Boutron taxonomy) |
| Outcome Reporting | Surrogate endpoints, composite endpoints, outcome switching from registered protocol |
| Conflict of Interest | Industry funding without disclosure, undisclosed author affiliations with sponsors |
| Methodological Red Flags | Inappropriate comparators, enrichment designs, per-protocol without ITT, premature stopping, short follow-up |
ollama run hherb/qwen3.5-9b-biasbuster5
Then paste a clinical trial abstract. The model will produce a <think> reasoning block followed by a structured JSON assessment.
Assess the following abstract for bias:
Background: We conducted a randomized, double-blind trial of Drug X versus placebo
in 1,200 patients with moderate-to-severe heart failure. The primary endpoint was
the composite of cardiovascular death or hospitalization for heart failure.
Results: Drug X reduced the primary composite endpoint (HR 0.72, 95% CI 0.59-0.88,
p=0.001). The study was funded by PharmaCorp Inc.
Conclusions: Drug X significantly reduces cardiovascular events and should be
considered for routine clinical use in heart failure management.
<think>
Statistical reporting: The abstract reports HR 0.72 (95% CI 0.59-0.88) — a relative
measure only. No absolute risk reduction, NNT, or baseline event rates provided.
The reader cannot assess clinical significance without external data...
Spin: The conclusion recommends "routine clinical use" based on a single trial with
only a composite endpoint. The language overstates the evidence...
Conflict of interest: Funded by PharmaCorp Inc. No author-level COI disclosure.
No information about author affiliations with the sponsor...
</think>
{
"statistical_reporting": {
"severity": "moderate",
"flags": {"relative_only": true, "absolute_reported": false, "nnt_reported": false},
"evidence": "HR 0.72 (95% CI 0.59-0.88, p=0.001) with no absolute measures",
"reasoning": "Only relative risk reduction reported; baseline event rate absent"
},
"spin": {
"severity": "high",
"flags": {"conclusion_matches_results": false},
"evidence": "Concludes 'should be considered for routine clinical use'",
"reasoning": "Recommends clinical adoption from a single trial without acknowledging limitations"
},
...
"verification_steps": [
"Search ClinicalTrials.gov for the trial registration to compare registered vs published endpoints",
"Check CMS Open Payments (openpaymentsdata.cms.gov) for author payment records from PharmaCorp",
"Search ORCID for author affiliation histories with PharmaCorp",
"Access full text via Europe PMC to review funding and COI disclosure sections"
]
}
Evaluated on 144 held-out test abstracts (80/10/10 split).
| Metric | Score |
|---|---|
| F1 | 0.924 |
| Recall | 0.950 |
| Precision | 0.898 |
| Accuracy | 0.868 |
| Domain | F1 | Precision | Recall |
|---|---|---|---|
| Outcome Reporting | 0.839 | 0.771 | 0.922 |
| Spin (Boutron) | 0.826 | 0.750 | 0.918 |
| Statistical Reporting | 0.806 | 0.713 | 0.926 |
| Methodology | 0.737 | 0.658 | 0.839 |
| Conflict of Interest | 0.698 | 0.677 | 0.720 |
| Metric | Score |
|---|---|
| Within-One Accuracy | 76.4% |
| Cohen’s Kappa (weighted) | 0.124 |
| Metric | Score |
|---|---|
| Thinking chains present | 100% |
| Mean reasoning length | 1,289 characters |
| Parse failures | 0% |
| Verification source score | 0.495 |
| Configuration | Binary F1 | Recall | Spin F1 | Verification |
|---|---|---|---|---|
| BiasBuster (9B fine-tuned) | 0.924 | 0.950 | 0.826 | 0.495 |
| Qwen3.5-9B zero-shot (v3 prompt) | 0.895 | 0.904 | 0.321 | 0.464 |
Fine-tuning dramatically improves spin detection (F1 0.321 -> 0.826) and produces structured reasoning chains (0% -> 100% thinking present).
The model recommends checks against these specific resources:
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3.5-9B |
| Method | LoRA (rank 32, alpha 64, dropout 0.08) |
| Training data | 1,235 curated clinical trial abstracts |
| Data sources | Retracted papers (Crossref), Cochrane RoB (Europe PMC), PubMed RCTs, ClinicalTrials.gov |
| Annotation | Claude + human review |
| Hardware | NVIDIA DGX Spark (GB10 Blackwell, 128 GB unified memory) |
| Training time | ~2.5 hours (927 steps, 3 epochs) |
| Precision | bfloat16 |
| Framework | TRL (SFTTrainer) + PEFT |
| Learning rate | 2e-4 (cosine with 10% warmup) |
| Max sequence length | 4,096 tokens |
The BiasBuster repository includes a verification agent that closes the loop between bias detection and evidence gathering. The model is trained to cite specific databases in its verification recommendations — the agent parses these tool-use recommendations from the model’s output and executes them against real databases automatically:
This works because the fine-tuned model reliably names the target databases in its output, enabling keyword-based routing without a secondary LLM call.
# Clone and set up
git clone https://github.com/hherb/biasbuster.git
cd biasbuster && uv sync
# Run the agent on a single abstract by PMID
uv run python -m agent --pmid 12345678
# Evaluate this model against a test set
uv run python -m evaluation.run \
--test-set dataset/export/alpaca/test.jsonl \
--model-a hherb/qwen3.5-9b-biasbuster5 \
--endpoint-a http://localhost:11434 \
--mode fine-tuned
@software{biasbuster2026,
title={BiasBuster: Teaching Small Language Models to Detect Bias in Medical Research},
author={Herb, Horst},
year={2026},
url={https://github.com/hherb/biasbuster}
}
Apache 2.0