MrScratchcat22/GLM-4.7-Flash-REAP-23B-A3B

Introducing GLM-4.7-Flash-REAP-23B-A3B, a memory-efficient compressed variant of GLM-4.7-Flash that maintains near-identical performance while being 25% lighter.

This model was created using REAP (Router-weighted Expert Activation Pruning), a novel expert pruning method that selectively removes redundant experts while preserving the router’s independent control over remaining experts. Key features include:

Near-Lossless Performance: Maintains almost identical accuracy on code generation, agentic coding, and function calling tasks compared to the full 355B model
25% Memory Reduction: Compressed from 355B to 218B parameters, significantly lowering deployment costs and memory requirements
Preserved Capabilities: Retains all core functionalities including code generation, agentic workflows, repository-scale understanding, and function calling
Drop-in Compatibility: Works with vanilla vLLM - no source modifications or custom patches required
Optimized for Real-World Use: Particularly effective for resource-constrained environments, local deployments, and academic research

Models

Readme