Qwen3.5 Small: the 9B model that beats gpt-oss-120B on four benchmarks
Alibaba released four small models on March 2. The 9B scores 82.5 on MMLU-Pro, ahead of gpt-oss-120B at 80.8 and Qwen3-30B at 80.9. It costs $0.05 per million input tokens on OpenRouter and runs locally on a single 20GB GPU.

Image source: Qwen Blog
Released March 2, Qwen3.5-9B scores 82.5 on MMLU-Pro, ahead of gpt-oss-120B at 80.8 and Qwen3-30B at 80.9, with 13 times fewer parameters. It costs $0.05 per million input tokens on OpenRouter, all four sizes are Apache 2.0 licensed, and the 9B extends to 1M context. Two genuine catches: hallucination rate runs 80-82% on factual benchmarks, and it generates 2-4x more output tokens than peers on equivalent tasks, so real output costs run higher than the nominal $0.15/M implies.
Four models, one architecture
The Qwen3.5 Small release on March 2 includes four parameter sizes: 0.8B, 2B, 4B, and 9B. All four share a gated DeltaNet hybrid architecture with a 3:1 ratio of linear to full attention layers, plus multi-token prediction. All are Apache 2.0 licensed, multilingual across 201 languages, and natively multimodal, handling text, images, and video in a single model.
Context window is 262K tokens for all four sizes. The 9B can extend to 1M tokens. VRAM requirements run roughly 2GB for the 0.8B, 5GB for the 2B, 10GB for the 4B, and 20GB for the 9B. All four are on Hugging Face and ModelScope.
Inference providers including OpenRouter and Venice had the 9B available within days of release. Alibaba Cloud DashScope also offers API access, though pricing there varied by tier in early March.
Pricing vs. the competition
Confirmed market rates as of March 2026. Prices for the smaller sizes (0.8B through 4B) vary by provider and are proportionally lower than the 9B.
| Model | Params | Input / 1M | Output / 1M | Context |
|---|---|---|---|---|
| Qwen3.5-9B | 9B | $0.05 | $0.15 | 1M (ext.) |
| gpt-oss-120B | 120B | $0.039 | $0.19 | 131K |
| gpt-oss-20B | 20B | $0.075 | $0.30 | 131K |
| Mistral Small 4 | ~6B active | $0.15 | $0.60 | 262K |
| GPT-5.4 Nano | - | $0.20 | $1.25 | 272K |
Sources: OpenRouter (Qwen3.5-9B), OpenRouter (gpt-oss-120B). See all model pricing on TokenCost.
The 9B vs. 120B benchmark story
On MMLU-Pro, Qwen3.5-9B scores 82.5. That is higher than gpt-oss-120B at 80.8 and Qwen3-30B at 80.9. The 120B model is 13 times larger by parameter count; the 30B is 3 times larger. Both had more time in the community before this comparison.
Artificial Analysis rates Qwen3.5-9B at 32 on their Intelligence Index, roughly twice the nearest sub-10B competitor at 16. On the multimodal benchmark MMMU-Pro, the 9B scores 69.2%, ahead of the previous Qwen3 VL 8B at 56.6%.
| Benchmark | Qwen3.5-9B | gpt-oss-120B | Qwen3-30B |
|---|---|---|---|
| MMLU-Pro | 82.5 | 80.8 | 80.9 |
| MMMU-Pro (vision) | 69.2% | N/A | N/A |
| HMMT Feb (math) | 83.2 | - | - |
| BFCL-V4 (function calling) | 66.1 | - | - |
| TAU2-Bench (tool use) | 79.1 | - | - |
| AA Intelligence Index | 32 | - | - |

The MMLU-Pro comparison is independently verified by Artificial Analysis, not just Alibaba's own benchmarks. The caveats in the next section still apply, but the core result holds up under third-party evaluation.
What it costs in practice
Three monthly cost scenarios against common alternatives. These use identical token counts across all models, so read the verbose output caveat below before relying on these for Qwen specifically.
Batch summarization
10M input, 2M output/month
RAG pipeline
100M input, 20M output/month
Async extraction
500M input, 50M output/month
High-volume chat
1B input, 200M output/month
Use the TokenCost calculator for your actual token counts.
Two things to check before switching
The hallucination rate on Artificial Analysis' Omniscience benchmark is 80-82% for the 4B and 9B models. That is roughly four out of five factual questions getting a wrong or made-up answer. For extraction, classification, or tasks where you verify outputs separately, this might not matter much. For knowledge retrieval or fact-heavy Q&A without a retrieval layer, it matters a lot.
Verbosity is the other thing to watch. In benchmark settings, Qwen3.5-9B generated 230-390 million output tokens for tasks where peers generated 86-109 million. That's 2-4x more output tokens for equivalent work. The nominal output rate of $0.15/M looks cheap compared to Mistral Small 4 at $0.60/M, but if the model uses 3x as many tokens to do the same job, the effective rate is closer to $0.45/M. The cost scenarios above don't account for this.
When it makes sense
Long-context document work is where this shines most. At $0.05/M input with a 1M context extension, processing large documents is meaningfully cheaper than alternatives. Multilingual deployments benefit from the 201-language coverage and Apache 2.0 license. On-device or edge scenarios are also strong fits given the four size options - the 0.8B runs in 2GB VRAM, which opens up a lot of deployment targets. Native multimodal support helps for vision tasks without paying a separate model premium.
Skip it for fact-intensive Q&A without retrieval augmentation. The 80-82% hallucination rate on factual benchmarks is too high to rely on for knowledge lookup tasks. Same goes for workloads where output verbosity costs money - anything that needs short, precise answers will likely end up paying more per completed task than the headline rate suggests.
The short version
Qwen3.5-9B is cheap, capable on standard benchmarks, multimodal, and runs locally. The MMLU-Pro result against gpt-oss-120B is genuinely notable - not because a 9B beating a 120B is expected, but because the architecture improvements in this generation made it possible.
The hallucination rate and verbose output are real limitations. We would test it against your actual workload before committing - benchmark results vary significantly by task type, and the verbosity issue hits hard on anything where output length matters.
For teams self-hosting and needing a capable open-weight model at the small end, this is the strongest option we have seen in the sub-10B category so far in 2026.