Skip to main content
TokenCost logoTokenCost
ResearchMay 7, 2026·8 min read

GLM-4.7-flash sits at $0.07 input on AWS Bedrock and Vertex. Most coverage skipped this one.

The GLM-5.1 launch ate every Z.ai headline this past month. Underneath that release, Z.ai quietly stabilized GLM-4.7-flash across AWS Bedrock, Vertex AI Model Garden, Cerebras, Together, and OpenRouter at $0.07 per million input tokens. That makes it the only 200K-context model under a dime per million inputs that an enterprise can procure without leaving their primary cloud. Here is what the bill looks like, where it sits relative to Haiku 4.5 and Gemini Flash-Lite, and the specific workloads where paying fourteen times less than Haiku on input changes the calculus.

Code editor in a dark room representing the GLM-4.7-flash budget tier API at $0.07 per million input tokens

Photo by Harshit Katiyar on Unsplash

  • $0.07/M input, $0.40/M output, ~$0.014/M cached input on Z.ai direct
  • ~200K input context window across providers (max output varies: Bedrock card lists 4K, OpenRouter shows higher)
  • Available on AWS Bedrock, Vertex AI Model Garden, Cerebras, Together, OpenRouter
  • Cheapest 200K-context API among models stocked on the major Western enterprise clouds
  • Sits one tier below GLM-4.7 ($0.60/$2.20) and two tiers below GLM-5 ($1.00/$3.20)
  • Naming caveat: Z.ai's direct pricing page lists this $0.07/$0.40 SKU as "GLM-4.7-FlashX"; Bedrock and most aggregators market the same model as "GLM-4.7-flash." Same weights, same rate, two names.

Why a four-month-old model is back in the conversation

GLM-4.7 and its flash variant launched in early 2026, well ahead of the GLM-5 family. In a normal release cycle that would put them on the shelf already. Z.ai took a different route: GLM-5 became the flagship, GLM-5.1 the agentic specialist, and the 4.7 lineup stuck around as the cost floor. Within a 30-day window the flash variant rolled out to AWS Bedrock, Vertex AI Model Garden, and Cerebras. The pricing did not change. The procurement story did.

For teams that cannot route traffic outside their primary cloud - banks, hospitals, government, anyone with a data residency clause - GLM-4.7-flash going to Bedrock means $0.07 input is now reachable through an existing AWS contract. That single piece of plumbing changed who can quote this number on a budget review. It is the reason this post exists in May rather than February.

Same model, five places to buy it

Published rates line up almost exactly across providers. Where they differ is on what is wrapped around the base rate - cached input pricing, batch endpoints, max output limits, and how the surcharge picture looks at peak hours.

ProviderInput / 1MOutput / 1MNotes
Z.ai direct$0.07$0.40Cached input ~$0.014; SKU labelled GLM-4.7-FlashX
AWS Bedrock$0.07$0.40Model ID zai.glm-4.7-flash; bills against AWS contract
Vertex AI MaaS$0.07$0.40GLM-4.7 family in Vertex Model Garden, IAM-gated
OpenRouter$0.06-0.07$0.40z-ai/glm-4.7-flash; lists $0.06 input on some routes
Cerebras$0.07$0.40Hosts a REAP-23B compressed variant alongside the full model

The base rate is the same number across all five surfaces, give or take a penny on OpenRouter. What differs is the wrapping. Bedrock gives you the model ID zai.glm-4.7-flash inside an AWS-billed account; Vertex puts it behind GCP IAM and partner-model routing; Cerebras runs an additional REAP-compressed 23B variant that trades capability for higher throughput; OpenRouter handles fail-over across multiple backends. None of these add per-token markup at the published list rate. That is unusual for an aggregator-routed model and is part of why the post is worth writing.

What 5M tokens of light-touch RAG actually costs

Light-touch RAG is the shape that exposes the input-heavy budget tier most clearly: large retrieved context, short answers. Pick a daily volume of 5M input tokens (roughly 2,500 queries with 2K of context each) and 200K output tokens. Same shape, seven different APIs:

ModelInput costOutput costDaily totalMultiple
GLM-4.7-flash$0.35$0.08$0.431.0x
Gemini 2.5 Flash-Lite$0.50$0.08$0.581.3x
DeepSeek V4-Flash$0.70$0.06$0.761.7x
GPT-5.4 Nano$1.00$0.25$1.252.9x
GLM-4.7$3.00$0.44$3.448.0x
Claude Haiku 4.5$5.00$1.00$6.0014.0x
GPT-5.4 Mini$3.75$0.90$4.6510.8x

5M input + 200K output, list prices. No batch or context-cache discounts. Cached input pricing on Z.ai would push the GLM-4.7-flash daily total under $0.10 for workloads with high prompt repetition. Cross-checked against Artificial Analysis provider data and AWS Bedrock model card.

Table shape tells you more than the absolute numbers. On input-bound RAG workloads, the spread between GLM-4.7-flash and Haiku 4.5 hits 14x: roughly the difference between paying for a beer and paying for a meal at the same bar. On output-bound workloads (long generation, code synthesis), that gap compresses because output rates across the budget tier converge in the $0.28-$5.00 band. Whether GLM-4.7-flash wins your bill comes down to your input ratio more than its raw input price tag.

What "flash" means inside the GLM stack

Z.ai borrowed the "flash" naming convention from Google, where it signals the cheap, fast tier in a family. The spread between flash and base, though, is wider for GLM than for Gemini. Google's 2.5 Flash sits at $0.30 input; Gemini 2.5 Pro sits at $1.25 - roughly a 4x gap. GLM-4.7-flash at $0.07 against GLM-4.7 at $0.60 is closer to 8.5x. Z.ai is willing to give up much more capability at the budget tier.

What you give up in exchange is straightforward: the flash variant is not the agentic-engineering specialist. It does not hold context across long tool-use chains the way GLM-5.1 was designed to. It is not the model to run an 8-hour autonomous task on. Z.ai and AWS describe it generically as "lightweight, optimized for fast inference and low-latency tasks." In practice that lines up with routing-layer work - classification, extraction, summarization, retrieval grounding, structured output - the kind of work that fills 80% of an LLM API bill at most companies but never makes it into a benchmark blog.

That positioning is honest, and it is the right framing to bring to a vendor comparison. Lining GLM-4.7-flash up against Opus 4.7 on SWE-Bench Pro is not a useful exercise. Lining it up against Haiku 4.5 and Flash-Lite on a high-volume extraction or routing pipeline is exactly the comparison the price was set for.

The 200K window in a tier full of 32K and 128K models

Budget-tier APIs do not all cap context tightly anymore. GPT-5.4 Nano runs at a full 1M context, DeepSeek V4-Flash holds 1M, Haiku 4.5 sits at 200K, Mistral Small 4 at 128K, Codestral-2508 at 256K. Inside that pack, GLM-4.7-flash's 200K ceiling is competitive without being unique. The window is plenty for RAG and classification but not the longest in the tier.

What is unique is the price-to-window ratio. Among models priced at or under $0.10 per million input tokens with at least a 128K context window available on a major Western cloud, the list is short: Gemini 2.5 Flash-Lite at $0.10 input, GLM-4.7-flash at $0.07. Two models. That is the bracket where this post earns its keep, not in the broader budget-tier roundup.

When GLM-4.7-flash beats the alternatives, and when it does not

Use it for: high-volume document classification, retrieval-grounded Q&A with long context windows, structured output extraction, log triage, intent routing, customer-support draft suggestions, batched summarization. Anywhere your workload shape is many short calls with significant input context and short outputs, the input-side discount compounds quickly.

Skip it for: agentic coding with tools, long autonomous runs, multi-turn reasoning chains, anything where output quality is doing most of the work. GLM-5.1 exists for the agentic side of the Z.ai stack, and Sonnet/Opus exist for the closed-source side. Reaching for GLM-4.7-flash there is a false economy. The output cost stays cheap but you pay it more often when retries cascade.

Practical advice is the obvious one: split the bill. Route the routing-layer work through GLM-4.7-flash, route the agent work through whatever flagship clears your quality bar (which on cost-per-correct-PR these days is usually GLM-5.1 or Kimi K2.6 in the open-weights camp, Sonnet 4.6 in the closed). Most teams running both layers see the GLM-4.7-flash slice cover 60-80% of total tokens at under 5% of total spend.

Sources