13 6 hours ago

Validated runtime configuration + measured results for an abliterated (uncensored) Qwen3-Coder-Next 80B-A3B running as an agentic coding backend on CPU via ik_llama.cpp

ollama run rafw007/qwen3-coder-next-80b-redteam

Models

View all →

Readme

Qwen3-Coder-Next-80B-A3B - Abliterated - Red-Team Coding Agent

Validated runtime configuration and measured results for an abliterated (uncensored) Qwen3-Coder-Next 80B-A3B running as an agentic coding backend on CPU via ik_llama.cpp. This model card documents our tested configuration and findings. Its value lies in presenting a measured, working setup and an honest account of what it can and cannot do.

SAFETY This model may produce sensitive, explicit, misleading, biased, or otherwise inappropriate content. Safety filtering is limited, and outputs should not be assumed to be safe by default.

INTENDED USE This model is intended for research, evaluation, red-teaming, and controlled use with human oversight. It is best suited for advanced users who can implement their own safeguards and review processes.

OUT-OF-SCOPE USE This model is not recommended for public-facing deployment without additional moderation. It should not be used in applications involving minors, regulated environments, high-risk domains, or automated decision-making.

RISKS AND RECOMMENDATIONS Outputs may be harmful, unlawful, privacy-invasive, or factually unreliable. Human review, logging, access controls, monitoring, and additional filtering are strongly recommended before any production or external use.

WHAT THIS IS Base: Qwen3-Coder-Next (qwen3next architecture, hybrid attention: linear/Gated-DeltaNet plus periodic full-attention layers). MoE: 80B total / ~3B active (A3B). Variant: abliterated / uncensored - standard refusal guardrails have been removed from the weights. Quantization: Q4_K_M GGUF (~48.5 GB weights; CPU buffer ~46.3 GiB with all expert tensors offloaded to CPU). Native context: 262144 (n_ctx_train). Served here: 65536 (64K) - chosen to preserve RAM headroom; increase it if your hardware allows. Role: agentic coding backend with tool calling for OpenAI-compatible and Anthropic-compatible coding clients.

VALIDATED RUNTIME CONFIGURATION ik_llama.cpp, CPU-only: llama-server
–model qwen3-coder-next-abliterated-Q4_K_M.gguf
–alias qwen3-coder-redteam
–ctx-size 65536
–cache-type-k q8_0 –cache-type-v q4_0 -fa on
–threads 14 –threads-batch 12 –n-cpu-moe 94
–mlock –no-mmap –batch-size 2048 –ubatch-size 512
–temp 0.1 –top-k 40 –top-p 0.9 –min-p 0.01 –repeat-penalty 1.0
–jinja –host 0.0.0.0 –port 8080 Runtime: ik_llama.cpp (the ikawrakow fork), chosen over vanilla llama.cpp for materially better CPU and MoE throughput. The KV cache is quantized (k=q8_0, v=q4_0) to keep a 64K context affordable on tight RAM budgets.

HARDWARE TESTED Mini PC, x86_64, 18 cores, 62 GB RAM, no GPU (Ubuntu 24.04). Pure CPU inference (–n-cpu-moe 94).

MEASURED PERFORMANCE Real-world results, not vendor claims: • opencode session, ik, 64K, 6607-token prompt: prompt processing 62.3 t/s, generation 14.4 t/s; real agentic session. • Small context, low n_past: about 16 t/s decode; throughput falls as the context window fills. • Cold load (–mlock, 48.5 GB): about 40 seconds before the port binds.

BEHAVIORAL FINDINGS • Tool calling works. • Real agentic runs under opencode (MCP tools, multi-file coding) succeeded; for example, the model generated a complete, working ~400-line HTML landing page. • message.tool_calls was clean, with no round-trip loop. • Code output is clean at Q4; no token corruption was observed in generated HTML, CSS, or JavaScript. Language drift appears to be a sampling artifact rather than a defect in the weights. During long mixed English/Polish generations at loose sampling (temp = 1.0, top_p = 1.0), the model may leak Cyrillic homoglyphs (for example, Cyrillic с and о inside Polish words) and English calques. At temp = 0.1 and top_p = 0.9, this issue disappears: repeated Polish generations scanned clean, with zero homoglyphs and correct diacritics. Recommendation: keep temp <= 0.4 for non-English prose, and lint outputs for homoglyphs if that matters.

CLIENT INTEGRATION opencode:
Point it directly at the OpenAI-compatible endpoint: http://HOST:8080/v1, using the model name qwen3-coder-redteam. This works directly. OpenAI Codex CLI:
Recent Codex versions dropped wire_api = “chat” and now require wire_api = “responses”. llama.cpp / ik_llama.cpp does not implement /v1/responses, so Codex needs a thin Responses-to-Chat proxy in front of it that strips unsupported fields and forwards requests to /v1/chat/completions. With that proxy, Codex works end-to-end. Claude Code:
This requires an Anthropic API shim. A minimal non-streaming shim can provide basic chat functionality, but it is not recommended for serious agentic use because it lacks streaming and proper tool-use translation. This setup was not validated here.

CREDITS AND LICENSE Base: built on top of bartowski’s qwen3-coder-next-abliterated-Q4_K_M.gguf, which is itself based on Qwen3-Coder-Next by Alibaba/Qwen. Usage is subject to the Qwen license. The user is responsible for compliance with that license, local law, and export regulations. How this model came to be:
This model was designed, built, and tested with the help of Claude Opus. The idea was simple: the world’s best coding model should be able to create smaller models in its own image. Its system prompts, parameters, and context configuration come directly from that work - the world’s best coding model preparing local models that can take over right on your desk.