Skip to main content
TokenCost logoTokenCost
Model ReleaseMay 3, 2026·8 min read

GLM-5.1 took SWE-Bench Pro at $1.40/M input. The catch is the small print, not the price.

Z.ai shipped a 754B-parameter MoE on April 7, 2026 that briefly held the SWE-Bench Pro crown at 58.4 - above GPT-5.4 and Claude Opus 4.6 - while charging less than a third of either. MIT licensed, trained on Huawei Ascend chips, and self-hostable at FP8. Here is what the bill actually looks like, where the benchmarks need a hedge, and what the move from $5/$25 Opus territory to $1.40/$4.40 Z.ai territory really gets you.

GLM-5.1 benchmark comparison chart showing SWE-Bench Pro and agentic engineering scores against GPT-5.4 and Claude Opus 4.6

Image source: Z.ai (zai-org/GLM-5 GitHub)

  • $1.40/1M input, $4.40/1M output, $0.26 cached input on Z.ai direct - input tokens cost 72% less than Opus 4.7
  • SWE-Bench Pro 58.4 (#1 globally on April 7, briefly held by Z.ai before Kimi K2.6 ticked past at 58.6)
  • 754B total / 40B active MoE, 200K context, 128K max output, MIT license
  • Trained entirely on Huawei Ascend 910B with MindSpore - no Nvidia in the stack
  • FP8 weights at 1.51TB - the FP8 repo has 795K downloads, more than the full-precision one

Two prices, one model

Z.ai publishes one rate. OpenRouter, which has been routing GLM-5.1 since launch week, publishes a lower one. Both are real. Quote whichever your billing actually sees.

RouteInput / 1MCached / 1MOutput / 1MNotes
Z.ai direct (official)$1.40$0.26$4.403x peak surcharge 14:00-18:00 Beijing
OpenRouter$1.05$0.21-0.52$3.5065,535 max output
Self-host (FP8)~750GB weights, runs on 8x H200 or 8x B200vLLM, SGLang, KTransformers

The peak surcharge is the gotcha most coverage buries. From 2pm to 6pm Beijing time the Z.ai direct rate triples - $4.20/1M input, $13.20/1M output. If your workload runs through a US business day or has burst traffic spanning that window, you are either paying through the surcharge or paying OpenRouter to absorb it for you. The $1.05/$3.50 OpenRouter rate looks like a discount on paper. It is partly a hedge against time-of-day pricing.

What 1.2M tokens of agentic coding actually costs

A realistic shape for a coding agent task: 1M tokens of context (system prompt, tool definitions, file content, prior turns) plus 200K tokens of output (reasoning traces, edits, tool calls). Same shape, different APIs:

ModelInput costOutput costTotalvs GLM-5.1
GLM-5.1 (OpenRouter)$1.05$0.70$1.750.77x
GLM-5.1 (Z.ai)$1.40$0.88$2.281.0x
Kimi K2.6$0.60$0.50$1.100.48x
DeepSeek V4-Pro (promo)$0.435$0.174$0.610.27x
Qwen 3.6 Max$1.30$1.56$2.861.25x
GPT-5.4$2.50$3.00$5.502.4x
Claude Opus 4.7$5.00$5.00$10.004.4x
GPT-5.5$5.00$6.00$11.004.8x

1M input + 200K output, list prices, no batch or cache discounts. DeepSeek V4-Pro promo runs through May 31, 2026. Reverts to $1.74/$3.48 after that. Cache and batch APIs (where supported) cut these numbers further but are workload-dependent.

DeepSeek V4-Flash and V4-Pro still beat GLM-5.1 on raw price. The honest framing is not "cheapest frontier coder" - it is "cheapest coder at the same SWE-Bench Pro tier as Opus and GPT-5." That is a narrower claim that holds up better.

The benchmark numbers, with the necessary hedges

Z.ai's release blog led with the SWE-Bench Pro 58.4 number. It is real, and at the time of release it was first place globally - above GPT-5.4's 57.7 and Opus 4.6's 57.3. That ordering held for about two weeks before Moonshot shipped Kimi K2.6 at 58.6.

GLM-5.1 benchmark chart on SWE-Bench Pro, Terminal-Bench, and agentic engineering tasks

Source: Z.ai official benchmark chart (z.ai/blog/glm-5.1).

BenchmarkGLM-5.1GPT-5.4Claude Opus 4.6Kimi K2.6
SWE-Bench Pro58.457.757.358.6
Terminal-Bench 2.063.5-69.0---
NL2Repo42.741.333.4-
MCP-Atlas (public)71.867.269.2-
GPQA Diamond86.2-91.3-
AIME 202695.3-98.296.7
HLE (with tools)52.3-45.0-
AA Intelligence Index51 (#4)575754

Two patterns show up in the table. On agentic engineering benchmarks - the ones built around tool use, repo navigation, and multi-step coding work - GLM-5.1 either leads or sits within rounding distance of the frontier. NL2Repo at 42.7 is a clean beat over GPT-5.4. MCP-Atlas public set at 71.8 is the highest published. HLE with-tools at 52.3 actually leads Opus 4.6.

On pure-knowledge QA without scaffolding, the gap reopens. GPQA Diamond at 86.2 is 5 points behind Opus 4.6. AIME 2026 at 95.3 is 3 points behind. The AA Intelligence Index, which weights breadth heavily, puts GLM-5.1 at #4 globally with a score of 51 - real, but not the "ties Opus" framing some launch-day coverage ran with.

What to take from this: if your workload is agentic engineering with tools, the published benchmarks support paying GLM-5.1 prices instead of Opus prices. If your workload is closer to long-form QA or research without tool use, the cheaper price costs you 5+ points on the relevant evals.

754B/40B MoE, MIT licensed, no Nvidia in the training stack

The architecture is recognisably DeepSeek-flavored. Multi-head Latent Attention for KV-cache compression, DeepSeek-style sparse attention, 256 routed experts plus one shared expert with top-8+1 active per token, 78 layers (first 3 dense, rest sparse). Where it diverges is the substrate underneath. Z.ai trained the entire model on Huawei Ascend 910B accelerators using MindSpore. No Nvidia silicon touched it.

For most readers this is sidebar trivia. For anyone whose compute strategy is shaped by export controls or supply-chain risk, it is the most interesting fact in the release. A frontier-class open-weights coder that was demonstrably trainable on non-US accelerators changes what a fall-back stack can look like, even if Nvidia remains the default.

On the inference side, Z.ai published an FP8 quantized variant alongside the full model. The FP8 repo on Hugging Face has more downloads than the BF16 one - a hint that most self-hosters are running quantized. At ~750GB the FP8 weights fit on a single 8x H200 or 8x B200 node, which is heavy but not exotic. vLLM, SGLang, xLLM, Transformers, and KTransformers all support the architecture out of the box. The HF model ID is zai-org/GLM-5.1.

The 8-hour autonomous task pitch

Z.ai's framing for GLM-5.1 is "long-horizon agentic engineering." They demoed a single autonomous task that ran for 8 hours straight - building a complete Linux desktop environment - and a vector DB optimization run that iterated 655 times to push a benchmark from 3,500 QPS to 21,500 QPS. KernelBench Level 3 saw a 3.6x geometric-mean speedup across the harness.

Long-horizon claims like these are easy to make and hard to verify. The honest read: the model holds context and persists effort across long tool-use sequences better than most open-weights peers. Whether it does so on your codebase, with your tools, at your error tolerance is the only test that matters. The 200K context window will be the binding constraint for any real monorepo work - the frontier-tier closed models are sitting at 1M+ now, and 200K runs out fast when an agent is reading multiple files.

One specific caveat worth noting: several launch-day pieces cited a "94.6% of Claude Opus 4.6 on coding parity" figure that came directly from Z.ai's own evals. Independent April 2026 third-party benchmarks confirmed the SWE-Bench Pro lead but did not reproduce parity claims uniformly. Treat self-reported composite numbers with the usual caution.

Reading the price spread on the SWE-Bench Pro tier

GLM-5.1 occupies a narrow but defensible slot. Roughly a fifth of Opus 4.7's blended cost. Roughly twice Kimi K2.6's. SWE-Bench Pro performance comparable to both. MIT licensed weights when none of those others are.

Building an agent and the decision is purely API-cost-vs-quality? Kimi K2.6 has a slight edge on both axes right now. If MIT licensing is a hard requirement - because legal said so, because you are reselling, because you need to self-host - GLM-5.1 is the only frontier-tier coder in that category. DeepSeek uses a custom license. Kimi's is Modified MIT. GLM is plain MIT.

For tokencost-style calculus, the relevant headline is that every model in the top SWE-Bench Pro tier - GLM-5.1, Kimi K2.6, GPT-5.5, Opus 4.7 - now sits within ~1 point of each other on that benchmark while spanning a price range from $1.10 to $11.00 on the same workload. The decision is no longer "which one is best at coding." It is "at what price-per-correct-PR does each one stop making sense for my workload."

Sources