Skip to main content
TokenCost logoTokenCost
Model ReleaseJune 5, 2026·8 min read

Qwen3.7 Plus costs a sixth of Max on input, and it can do the one thing Max can't: read images

Alibaba shipped Qwen3.7 Plus on June 1 at $0.40 input and $1.60 output per million tokens. That is a sixth of what Qwen3.7 Max charges on input and under a quarter on output, and unlike Max it takes image and video input. BABA stock jumped more than 6% the next morning. We dug into the rate card, the real cost gap against Max and Gemini 3.5 Flash, the benchmark scores worth trusting, and the discrepancy in the output price that two sources can't agree on.

Abstract purple and red light streaks on a dark background representing multimodal token throughput

Photo by Liana S on Unsplash

The price is the headline, but vision is the actual story

Two weeks ago we wrote up Qwen3.7 Max and flagged a limit buried under the launch noise: it is text-only. No image input, despite coverage that implied otherwise. If your agent reads a screenshot or a design mock, Max can't help. Qwen3.7 Plus is Alibaba's answer to that gap. Same 1M context, same agentic toolkit, plus vision and video understanding, at a fraction of the price. The cheapest way to describe it is "Max with eyes, minus most of the bill."

The rate card: $0.40 input, $1.60 output, $0.08 cached input per million tokens. One wrinkle to know before you budget. Artificial Analysis logs the input and cache numbers exactly but lists output at $1.16, not the $1.60 every announcement quoted. We could not reconcile the two against Alibaba's own pricing doc, which had not added the 3.7 tier yet when we checked. Plan on $1.60 so a correction doesn't blow your forecast, and treat $1.16 as a possible pleasant surprise.

ModelInput / 1MOutput / 1MVisionContext
Qwen3.7 Plus$0.40$1.60Yes1M
Qwen3.7 Max$2.50$7.50No1M
Gemini 3.5 Flash$1.50$9.00Yes1M
Claude Opus 4.8$5.00$25.00Yes1M
GPT-5.5$5.00$30.00Yes1M

The honest peer here is Gemini 3.5 Flash. Both are multimodal, mid-tier, 1M-context models built for volume, and Plus undercuts it on both sides of the meter. The frontier rows are there for scale: Plus is one-twelfth of GPT-5.5 on input. Nobody cross-shops Plus against Opus on a hard reasoning task, but plenty of teams run cheap multimodal models for the boring 80% of their traffic and route the hard 20% up to a frontier model. That split is exactly where Plus wants to live.

Where it lands on benchmarks

Artificial Analysis puts the Qwen3.7 Plus Intelligence Index at 53. That sits four points under Qwen3.7 Max (57) and a notch below Gemini 3.5 Flash (55.3), which is about what you would expect from a model priced this far down. The number I would not lean on too hard: Plus is verbose, burning 110M tokens on the AA Index run against a roughly 29M average, so a slice of any cost saving gets eaten back by reasoning tokens on hard prompts.

On vision and agentic grounding, Alibaba's own chart pits Plus against Opus 4.6, GPT-5.4, and Gemini 3.1 Pro and reports a ScreenSpot Pro score of 79.0, ahead of both GPT-5.4 and Opus 4.6 on GUI grounding. Take the agentic figures below as vendor-reported until a third party reruns them. The screen-grounding result is the one that matters for the use case Plus is built for: agents that click around a UI from screenshots.

BenchmarkQwen3.7 PlusQwen3.7 Max
AA Intelligence Index5357
ScreenSpot Pro (vision)79.0n/a (text only)
Terminal-Bench (reported)70.369.7
MCP-Atlas (reported)76.476.4

Read the bottom two rows with a raised eyebrow. They come from a single secondary write-up, not Alibaba's primary chart, and they imply Plus matches or edges Max on agentic tooling, which would be odd for the cheaper model. More likely the runs used different harness versions. The safe read: Plus is a touch behind Max on reasoning, roughly level on agentic tooling, and the only one of the two that sees.

What real workloads cost

All list rates, no promo math. The vision row is where Max drops out entirely, since it can't take an image at any price. A 1080p screenshot runs about 1,300 input tokens, so a visual agent step stays cheap on Plus.

WorkloadQwen3.7 PlusQwen3.7 MaxGemini 3.5 FlashGPT-5.5
Screenshot agent step (30K in / 3K out)$0.017n/a$0.072$0.240
Coding agent step (50K in / 10K out)$0.036$0.200$0.165$0.550
Big doc parse (150K in / 8K out)$0.073$0.435$0.297$0.990
1B tokens a month (70/30 blend)$760$4,000$3,750$12,500

At the billion-token tier Plus runs about a fifth of the Max bill and a fifth of Gemini 3.5 Flash. The catch I keep flagging on Qwen models applies here too: Plus is chatty, so a reasoning-heavy mix pushes more tokens through the output meter than the table assumes. If your traffic is mostly extraction, classification, and screenshot reading rather than long chains of thought, the real bill tracks these numbers closely. If it is heavy reasoning, pad the output column.

How to actually run it

Plus is proprietary and API-only through Alibaba Cloud Model Studio, formerly Bailian. No open weights, no Hugging Face card, no self-host. You get OpenAI-compatible endpoints in three regions: Singapore on dashscope-intl, the US on dashscope-us, and Beijing on the mainland endpoint. Images go in as URLs or base64, video as frames or clips, and both bill as input tokens at $0.40 per million.

One number is missing from every source we checked: the max output ceiling for Plus is unpublished. Max caps at 65,536 tokens, so if you are building a long-form report agent on Plus, test the output limit before you ship rather than assuming it matches the flagship.

Who should switch

If you run a vision agent on a frontier model today, Plus is the obvious test. A screenshot-reading or document-parsing workload that costs a quarter on GPT-5.5 or Opus drops to under two cents on Plus, and the ScreenSpot Pro result says the grounding holds up. The same logic covers high-volume multimodal pipelines: OCR, receipt parsing, UI testing, video tagging. This is the cheapest credible model in that slot right now.

Skip it where the answer has to be right the first time. The four-point Intelligence Index gap to Max and the bigger gap to GPT-5.5 are real, and the verbosity means hard reasoning prompts cost more than the sticker implies. For pure text reasoning at volume, Qwen3.7 Max or one of the cheaper text specialists may serve better. And if you need open weights for self-hosting or an air-gapped deployment, Plus is closed, so it is off the table from the start.

Sources