glm-5:cloud

A strong reasoning and agentic model from Z.ai with 744B total parameters (40B active), built for complex systems engineering and long-horizon tasks.

GLM-5 is a mixture-of-experts model from Z.ai with 744B total parameters and 40B active parameters. It scales up from GLM-4.5’s 355B parameters and is designed for complex reasoning, coding, and agentic tasks.

The model uses DeepSeek Sparse Attention (DSA) to reduce deployment costs while preserving long-context capacity, and was post-trained using a novel asynchronous RL infrastructure for improved training efficiency.

Key capabilities

Reasoning and math: Scores 92.7% on AIME 2026 I and 86.0% on GPQA-Diamond
Coding: Achieves 77.8% on SWE-bench Verified and 73.3% on SWE-bench Multilingual
Agentic tasks: 62.0 on BrowseComp and 56.2 on Terminal-Bench 2.0
Long context: Supports up to 128K+ context through DSA optimization
Multilingual: Supports English and Chinese

Benchmarks

Footnote

Humanity’s Last Exam (HLE) & other reasoning tasks: We evaluate with a maximum generation length of 131,072 tokens (temperature=1.0, top_p=0.95, max_new_tokens=131072). By default, we report the text-only subset; results marked with * are from the full set. We use GPT-5.2 (medium) as the judge model. For HLE-with-tools, we use a maximum context length of 202,752 tokens.
SWE-bench & SWE-bench Multilingual: We run the SWE-bench suite with OpenHands using a tailored instruction prompt. Settings: temperature=0.7, top_p=0.95, max_new_tokens=16384, with a 200K context window.
BrowserComp: Without context management, we retain details from the most recent 5 turns. With context management, we use the same discard-all strategy as DeepSeek-v3.2 and Kimi K2.5.
Terminal-Bench 2.0 (Terminus 2): We evaluate with the Terminus framework using timeout=2h, temperature=0.7, top_p=1.0, max_new_tokens=8192, with a 128K context window. Resource limits are capped at 16 CPUs and 32 GB RAM.
Terminal-Bench 2.0 (Claude Code): We evaluate in Claude Code 2.1.14 (think mode, default effort) with temperature=1.0, top_p=0.95, max_new_tokens=65536. We remove wall-clock time limits due to generation speed, while preserving per-task CPU and memory constraints. Scores are averaged over 5 runs. We fix environment issues introduced by Claude Code and also report results on a verified Terminal-Bench 2.0 dataset that resolves ambiguous instructions (see: https://huggingface.co/datasets/zai-org/terminal-bench-2-verified).
CyberGym: We evaluate in Claude Code 2.1.18 (think mode, no web tools) with (temperature=1.0, top_p=1.0, max_new_tokens=32000) and a 250-minute timeout per task. Results are single-run Pass@1 over 1,507 tasks.
MCP-Atlas: All models are evaluated in think mode on the 500-task public subset with a 10-minute timeout per task. We use Gemini 3 Pro as the judge model.
τ²-bench: We add a small prompt adjustment in Retail and Telecom to avoid failures caused by premature user termination. For Airline, we apply the domain fixes proposed in the Claude Opus 4.5 system card.
Vending Bench 2: Runs are conducted independently by Andon Labs.

A strong reasoning and agentic model from Z.ai with 744B total parameters (40B active), built for complex systems engineering and long-horizon tasks.

Readme

Key capabilities

Benchmarks

Footnote

Supported languages

License

Reference