How much does Qwen3.5-9B cost per million tokens?

On OpenRouter, Qwen3.5-9B costs $0.05 per million input tokens and $0.15 per million output tokens. The weights are Apache 2.0 licensed, so you can also self-host at no per-token cost.

How does Qwen3.5-9B compare to gpt-oss-120B?

Qwen3.5-9B scores 82.5 on MMLU-Pro versus 80.8 for gpt-oss-120B, despite having 13 times fewer parameters. However, gpt-oss-120B is cheaper on input tokens ($0.039 vs $0.05/M), and Qwen3.5-9B tends to generate 2-4x more output tokens for equivalent tasks.

What context window does Qwen3.5-9B support?

Qwen3.5-9B has a 262K native context window, extendable to 1M tokens. The smaller models (0.8B, 2B, 4B) cap at 262K.

Model ReleaseMarch 24, 2026·7 min read

Qwen3.5 Small: the 9B model that beats gpt-oss-120B on four benchmarks

Alibaba released four small models on March 2. The 9B scores 82.5 on MMLU-Pro, ahead of gpt-oss-120B at 80.8 and Qwen3-30B at 80.9. It costs $0.05 per million input tokens on OpenRouter and runs locally on a single 20GB GPU.

Qwen3.5 Small model series benchmark results showing the 9B model outperforming larger models

Image source: Qwen Blog

Released March 2, Qwen3.5-9B scores 82.5 on MMLU-Pro, ahead of gpt-oss-120B at 80.8 and Qwen3-30B at 80.9, with 13 times fewer parameters. It costs $0.05 per million input tokens on OpenRouter, all four sizes are Apache 2.0 licensed, and the 9B extends to 1M context. Two genuine catches: hallucination rate runs 80-82% on factual benchmarks, and it generates 2-4x more output tokens than peers on equivalent tasks, so real output costs run higher than the nominal $0.15/M implies.

Four models, one architecture

The Qwen3.5 Small release on March 2 includes four parameter sizes: 0.8B, 2B, 4B, and 9B. All four share a gated DeltaNet hybrid architecture with a 3:1 ratio of linear to full attention layers, plus multi-token prediction. All are Apache 2.0 licensed, multilingual across 201 languages, and natively multimodal, handling text, images, and video in a single model.

Context window is 262K tokens for all four sizes. The 9B can extend to 1M tokens. VRAM requirements run roughly 2GB for the 0.8B, 5GB for the 2B, 10GB for the 4B, and 20GB for the 9B. All four are on Hugging Face and ModelScope.

Inference providers including OpenRouter and Venice had the 9B available within days of release. Alibaba Cloud DashScope also offers API access, though pricing there varied by tier in early March.

Pricing vs. the competition

Confirmed market rates as of March 2026. Prices for the smaller sizes (0.8B through 4B) vary by provider and are proportionally lower than the 9B.

Model	Params	Input / 1M	Output / 1M	Context
Qwen3.5-9B	9B	$0.05	$0.15	1M (ext.)
gpt-oss-120B	120B	$0.039	$0.19	131K
gpt-oss-20B	20B	$0.075	$0.30	131K
Mistral Small 4	~6B active	$0.15	$0.60	262K
GPT-5.4 Nano	-	$0.20	$1.25	272K

Sources: OpenRouter (Qwen3.5-9B), OpenRouter (gpt-oss-120B). See all model pricing on TokenCost.

The 9B vs. 120B benchmark story

On MMLU-Pro, Qwen3.5-9B scores 82.5. That is higher than gpt-oss-120B at 80.8 and Qwen3-30B at 80.9. The 120B model is 13 times larger by parameter count; the 30B is 3 times larger. Both had more time in the community before this comparison.

Artificial Analysis rates Qwen3.5-9B at 32 on their Intelligence Index, roughly twice the nearest sub-10B competitor at 16. On the multimodal benchmark MMMU-Pro, the 9B scores 69.2%, ahead of the previous Qwen3 VL 8B at 56.6%.

Benchmark	Qwen3.5-9B	gpt-oss-120B	Qwen3-30B
MMLU-Pro	82.5	80.8	80.9
MMMU-Pro (vision)	69.2%	N/A	N/A
HMMT Feb (math)	83.2	-	-
BFCL-V4 (function calling)	66.1	-	-
TAU2-Bench (tool use)	79.1	-	-
AA Intelligence Index	32	-	-

Artificial Analysis Intelligence Index showing Qwen3.5-9B scoring 32, roughly twice the nearest sub-10B model

The MMLU-Pro comparison is independently verified by Artificial Analysis, not just Alibaba's own benchmarks. The caveats in the next section still apply, but the core result holds up under third-party evaluation.

What it costs in practice

Three monthly cost scenarios against common alternatives. These use identical token counts across all models, so read the verbose output caveat below before relying on these for Qwen specifically.

Batch summarization

10M input, 2M output/month

Qwen3.5-9B$0.80

Mistral Small 4$2.70

GPT-5.4 Nano$4.50

RAG pipeline

100M input, 20M output/month

Qwen3.5-9B$8.00

Mistral Small 4$27.00

GPT-5.4 Nano$45.00

Async extraction

500M input, 50M output/month

Qwen3.5-9B$32.50

Mistral Small 4$105

GPT-5.4 Nano$162.50

High-volume chat

1B input, 200M output/month

Qwen3.5-9B$80

Mistral Small 4$270

GPT-5.4 Nano$450

Use the TokenCost calculator for your actual token counts.

Two things to check before switching

The hallucination rate on Artificial Analysis' Omniscience benchmark is 80-82% for the 4B and 9B models. That is roughly four out of five factual questions getting a wrong or made-up answer. For extraction, classification, or tasks where you verify outputs separately, this might not matter much. For knowledge retrieval or fact-heavy Q&A without a retrieval layer, it matters a lot.

Verbosity is the other thing to watch. In benchmark settings, Qwen3.5-9B generated 230-390 million output tokens for tasks where peers generated 86-109 million. That's 2-4x more output tokens for equivalent work. The nominal output rate of $0.15/M looks cheap compared to Mistral Small 4 at $0.60/M, but if the model uses 3x as many tokens to do the same job, the effective rate is closer to $0.45/M. The cost scenarios above don't account for this.

When it makes sense

Long-context document work is where this shines most. At $0.05/M input with a 1M context extension, processing large documents is meaningfully cheaper than alternatives. Multilingual deployments benefit from the 201-language coverage and Apache 2.0 license. On-device or edge scenarios are also strong fits given the four size options - the 0.8B runs in 2GB VRAM, which opens up a lot of deployment targets. Native multimodal support helps for vision tasks without paying a separate model premium.

Skip it for fact-intensive Q&A without retrieval augmentation. The 80-82% hallucination rate on factual benchmarks is too high to rely on for knowledge lookup tasks. Same goes for workloads where output verbosity costs money - anything that needs short, precise answers will likely end up paying more per completed task than the headline rate suggests.

The short version

Qwen3.5-9B is cheap, capable on standard benchmarks, multimodal, and runs locally. The MMLU-Pro result against gpt-oss-120B is genuinely notable - not because a 9B beating a 120B is expected, but because the architecture improvements in this generation made it possible.

The hallucination rate and verbose output are real limitations. We would test it against your actual workload before committing - benchmark results vary significantly by task type, and the verbosity issue hits hard on anything where output length matters.

For teams self-hosting and needing a capable open-weight model at the small end, this is the strongest option we have seen in the sub-10B category so far in 2026.

Compare All Model Pricing Calculate Your API Costs

Sources

Compare All Model Pricing Calculate Your API Costs