174 1 week ago

tools thinking
ollama run MrScratchcat22/GLM-4.7-Flash-REAP-23B-A3B

Details

1 week ago

fac1e5dddd39 · 14GB ·

deepseek2
·
23B
·
Q4_K_M
MIT License Copyright (c) [year] [fullname] Permission is hereby granted, free of charge, to any per
{ "stop": [ "<|user|>" ], "temperature": 1 }
{{ .Prompt }}

Readme

Introducing GLM-4.7-Flash-REAP-23B-A3B, a memory-efficient compressed variant of GLM-4.7-Flash that maintains near-identical performance while being 25% lighter.

This model was created using REAP (Router-weighted Expert Activation Pruning), a novel expert pruning method that selectively removes redundant experts while preserving the router’s independent control over remaining experts. Key features include:

  • Near-Lossless Performance: Maintains almost identical accuracy on code generation, agentic coding, and function calling tasks compared to the full 355B model
  • 25% Memory Reduction: Compressed from 355B to 218B parameters, significantly lowering deployment costs and memory requirements
  • Preserved Capabilities: Retains all core functionalities including code generation, agentic workflows, repository-scale understanding, and function calling
  • Drop-in Compatibility: Works with vanilla vLLM - no source modifications or custom patches required
  • Optimized for Real-World Use: Particularly effective for resource-constrained environments, local deployments, and academic research