173 1 week ago

tools thinking
ollama run MrScratchcat22/GLM-4.7-Flash-REAP-23B-A3B

Models

View all →

Readme

Introducing GLM-4.7-Flash-REAP-23B-A3B, a memory-efficient compressed variant of GLM-4.7-Flash that maintains near-identical performance while being 25% lighter.

This model was created using REAP (Router-weighted Expert Activation Pruning), a novel expert pruning method that selectively removes redundant experts while preserving the router’s independent control over remaining experts. Key features include:

  • Near-Lossless Performance: Maintains almost identical accuracy on code generation, agentic coding, and function calling tasks compared to the full 355B model
  • 25% Memory Reduction: Compressed from 355B to 218B parameters, significantly lowering deployment costs and memory requirements
  • Preserved Capabilities: Retains all core functionalities including code generation, agentic workflows, repository-scale understanding, and function calling
  • Drop-in Compatibility: Works with vanilla vLLM - no source modifications or custom patches required
  • Optimized for Real-World Use: Particularly effective for resource-constrained environments, local deployments, and academic research