MrScratchcat22/GLM-4.7-Flash-REAP-23B-A3B

Details

Updated 1 week ago

1 week ago

fac1e5dddd39 · 14GB ·

model

archdeepseek2

parameters23B

quantizationQ4_K_M

14GB

license

MIT License Copyright (c) [year] [fullname] Permission is hereby granted, free of charge, to any per

1.1kB

params

{ "stop": [ "<|user|>" ], "temperature": 1 }

48B

template

13B

Introducing GLM-4.7-Flash-REAP-23B-A3B, a memory-efficient compressed variant of GLM-4.7-Flash that maintains near-identical performance while being 25% lighter.

This model was created using REAP (Router-weighted Expert Activation Pruning), a novel expert pruning method that selectively removes redundant experts while preserving the router’s independent control over remaining experts. Key features include:

Near-Lossless Performance: Maintains almost identical accuracy on code generation, agentic coding, and function calling tasks compared to the full 355B model
25% Memory Reduction: Compressed from 355B to 218B parameters, significantly lowering deployment costs and memory requirements
Preserved Capabilities: Retains all core functionalities including code generation, agentic workflows, repository-scale understanding, and function calling
Drop-in Compatibility: Works with vanilla vLLM - no source modifications or custom patches required
Optimized for Real-World Use: Particularly effective for resource-constrained environments, local deployments, and academic research

Details

Readme