103 Downloads Updated 1 week ago
ollama run rafw007/glm-4.7-flash-opencode
ollama launch claude --model rafw007/glm-4.7-flash-opencode
ollama launch codex-app --model rafw007/glm-4.7-flash-opencode
ollama launch openclaw --model rafw007/glm-4.7-flash-opencode
ollama launch hermes --model rafw007/glm-4.7-flash-opencode
ollama launch codex --model rafw007/glm-4.7-flash-opencode
ollama launch opencode --model rafw007/glm-4.7-flash-opencode
A family of custom models built on GLM-4.7-Flash (MoE, 30B total / 3B active), tuned to act as autonomous coding agents — each variant targeting a specific harness (opencode, codex, and Claude Code soon). The models speak native tool-calling, so they fully run agentic coding tools locally — your code never leaves your machine, and your cloud token cost drops to zero.
In thinking mode the base GLM-4.7-Flash is very slow on Apple Silicon (several minutes for a simple reply). This family separates reasoning from output: thinking goes into a dedicated thinking field, while content carries only the clean answer and tool calls. The result: the model responds immediately instead of monologuing, and the output is clean for the harness parser.
| Model | Base | Context | Purpose |
|---|---|---|---|
glm-4.7-flash-opencode |
GLM-4.7-Flash (MoE 30B / 3B active) | 64K | Published. Coding agent tuned for opencode (also works in codex). Clean content, no hallucinations, real tool-calling. |
glm-4.7-flash-claude-code |
GLM-4.7-Flash | 64K | In progress. Variant for Claude Code (CC overrides thinking control — needs a dedicated renderer/template). |
ollama run rafw007/glm-4.7-flash-opencode
In opencode:
ollama launch opencode rafw007/glm-4.7-flash-opencode
/nothink together with our SYSTEM prompt moves reasoning into a separate field — no monologue leaks into content.df -h, du, nmap, system_profiler instead of Linux-only commands.| Task | Verdict |
|---|---|
du / df |
Read the real disk output, no fabrication. |
nmap |
Handled permission limits and returned 22 real hosts. |
| Tetris | Full, working implementation — 396 lines (score, levels, next-piece preview, controls, game-over screen). |
Performance (measured, same model, 64K context, 100% GPU, ~25 GB in memory):
| Hardware | Generation | Prompt eval |
|---|---|---|
| Mac Studio M2 (32 GB) | ~46 tok/s | ~494 tok/s |
| Mac Mini M4 (32 GB) | ~25 tok/s | ~250 tok/s |
The Studio is nearly 2× faster at identical quality — the difference comes from memory bandwidth, not the model.
-claude-code variant is in the works.</tool_call> tag is sometimes printed as text (parser mismatch on the harness side).Designed, built, and tested with Claude Opus — the idea: the world’s best coding model builds smaller models in its own image that take over the work right on your desk. The system prompts, parameter choices, and context configuration come directly from that work.
MIT (inherited from the base GLM-4.7-Flash).
© 2026 rafw007