7,105 23 hours ago

A strong reasoning and agentic model from Z.ai with 744B total parameters (40B active), built for complex systems engineering and long-horizon tasks.

tools thinking cloud
ollama run glm-5:cloud

Applications

Claude Code
Claude Code ollama launch claude --model glm-5:cloud
Codex
Codex ollama launch codex --model glm-5:cloud
OpenCode
OpenCode ollama launch opencode --model glm-5:cloud
OpenClaw
OpenClaw ollama launch openclaw --model glm-5:cloud

Models

View all →

Readme

GLM-5 is a mixture-of-experts model from Z.ai with 744B total parameters and 40B active parameters. It scales up from GLM-4.5’s 355B parameters and is designed for complex reasoning, coding, and agentic tasks.

The model uses DeepSeek Sparse Attention (DSA) to reduce deployment costs while preserving long-context capacity, and was post-trained using a novel asynchronous RL infrastructure for improved training efficiency.

Key capabilities

  • Reasoning and math: Scores 92.7% on AIME 2026 I and 86.0% on GPQA-Diamond
  • Coding: Achieves 77.8% on SWE-bench Verified and 73.3% on SWE-bench Multilingual
  • Agentic tasks: 62.0 on BrowseComp and 56.2 on Terminal-Bench 2.0
  • Long context: Supports up to 128K+ context through DSA optimization
  • Multilingual: Supports English and Chinese

Benchmarks

Footnote

  • Humanity’s Last Exam (HLE) & other reasoning tasks: We evaluate with a maximum generation length of 131,072 tokens (temperature=1.0, top_p=0.95, max_new_tokens=131072). By default, we report the text-only subset; results marked with * are from the full set. We use GPT-5.2 (medium) as the judge model. For HLE-with-tools, we use a maximum context length of 202,752 tokens.
  • SWE-bench & SWE-bench Multilingual: We run the SWE-bench suite with OpenHands using a tailored instruction prompt. Settings: temperature=0.7, top_p=0.95, max_new_tokens=16384, with a 200K context window.
  • BrowserComp: Without context management, we retain details from the most recent 5 turns. With context management, we use the same discard-all strategy as DeepSeek-v3.2 and Kimi K2.5.
  • Terminal-Bench 2.0 (Terminus 2): We evaluate with the Terminus framework using timeout=2h, temperature=0.7, top_p=1.0, max_new_tokens=8192, with a 128K context window. Resource limits are capped at 16 CPUs and 32 GB RAM.
  • Terminal-Bench 2.0 (Claude Code): We evaluate in Claude Code 2.1.14 (think mode, default effort) with temperature=1.0, top_p=0.95, max_new_tokens=65536. We remove wall-clock time limits due to generation speed, while preserving per-task CPU and memory constraints. Scores are averaged over 5 runs. We fix environment issues introduced by Claude Code and also report results on a verified Terminal-Bench 2.0 dataset that resolves ambiguous instructions (see: https://huggingface.co/datasets/zai-org/terminal-bench-2-verified).
  • CyberGym: We evaluate in Claude Code 2.1.18 (think mode, no web tools) with (temperature=1.0, top_p=1.0, max_new_tokens=32000) and a 250-minute timeout per task. Results are single-run Pass@1 over 1,507 tasks.
  • MCP-Atlas: All models are evaluated in think mode on the 500-task public subset with a 10-minute timeout per task. We use Gemini 3 Pro as the judge model.
  • τ²-bench: We add a small prompt adjustment in Retail and Telecom to avoid failures caused by premature user termination. For Airline, we apply the domain fixes proposed in the Claude Opus 4.5 system card.
  • Vending Bench 2: Runs are conducted independently by Andon Labs.

Supported languages

  • English
  • Chinese

License

MIT

Reference