Skip to main content
TokenCost logoTokenCost
ResearchApril 30, 2026·8 min read

Qwen3-Next-Thinking: the cheapest reasoning model under $1/M output

Most pricing posts in 2026 chase the GPT-5.5 vs Opus 4.7 frontier. Look one tier down and there is a stranger fact: a model that costs $0.0975 in and $0.78 out can score 87.8 on AIME25 and 77.2 on GPQA. It is open weights, it is permissively licensed, and almost nobody is writing about it.

Glowing AI chip on a dark circuit board representing sparse MoE activation

Photo by Immo Wegmann on Unsplash

What it costs versus its peers

ModelInput / 1MOutput / 1MNative traceLicense
Qwen3-Next-80B-A3B-Thinking$0.0975$0.78YesApache 2.0
DeepSeek V4 Flash$0.14$0.28NoOpen
Qwen3-VL-235B Instruct$0.20$0.88NoOpen
Gemini 3 Flash$0.50$3.00OptionalClosed
GPT-5.4 Mini$0.75$4.50OptionalClosed
Claude Haiku 4.5$1.00$5.00OptionalClosed
Qwen3-Max$0.78$3.90NoClosed

Pricing from OpenRouter cheap-route listings (April 30, 2026), DeepSeek API docs, official Anthropic and Google rates. "Native trace" means the model emits a chain-of-thought by default without prompting tricks.

Why this is the model nobody is writing about

Qwen3-Next-80B-A3B-Thinking shipped from Alibaba in September 2025. Eight months later the AI press cycle has moved through GPT-5.4, GPT-5.5, Claude Opus 4.7, DeepSeek V4, and three Gemini revisions. None of those are open weights at this price tier. None of them produce native reasoning traces below $1 per million output tokens.

The benchmark scores got buried because Qwen released them alongside Qwen3-Max and Qwen3-VL the same week. Coverage focused on the flagship 235B variant. The 80B Thinking SKU is the better deal.

We pulled the numbers because a reader asked us why our pricing table did not include it. Once we ran the cost-per-task math, the answer was simple: nothing else is close on $/correct-AIME-answer. Below $1 output and above 85 on AIME25 is a combination only this model offers right now.

How 80B parameters get priced like 7B

The architecture matters here, not as trivia, but because it explains the price tag. Qwen3-Next is a high-sparsity Mixture of Experts. Total parameters: 80 billion. Active parameters per token: roughly 3 billion. The router picks 3B worth of experts for each token, the rest sit idle.

Serving cost on a GPU scales with active parameters and KV cache, not parameter count. A 3B-active model fits on a single H100 with room for batching. Throughput approaches what you would see from a Llama 3.1 8B deployment. The provider that fronts it on OpenRouter passes that math through to your bill.

The catch is quality. MoE routing is hit or miss; some queries activate the wrong experts and the answer reflects that. Qwen tunes this with a hybrid Gated DeltaNet plus Gated Attention setup that NVIDIA wrote up in detail in their blog about the architecture. The published benchmarks suggest the routing works well enough on math, coding, and tool use to compete with dense 30-32B models in the same family.

Net effect: the sparsity dividend lands on your invoice. You pay 7B-tier serving costs for 80B-tier output, on workloads where the router cooperates. That is a real arbitrage if you set up the right evals before committing.

Cost per AIME-style problem, with realistic trace lengths

Reasoning model pricing is misleading until you account for the trace. Thinking-style models emit long chains of internal reasoning before producing the final answer. Qwen warns explicitly on the model card that the trace is longer than its 30B predecessor. Real-world output token counts for a single AIME problem land between 4,000 and 10,000 tokens depending on difficulty.

Here is what a single hard problem actually costs. Assume 1,500 input tokens (problem statement plus a short system prompt) and 8,000 output tokens (typical for an AIME walk-through with answer):

ModelCost / problemAIME25 scoreNotes
Qwen3-Next-Thinking$0.006487.8$0.0975 in / $0.78 out, 8K trace
DeepSeek V4 Flash$0.0024Not native reasonerCheap but no thinking trace
Gemini 3 Flash (with thinking)$0.024782.4$0.50 in / $3.00 out
Claude Haiku 4.5 (extended thinking)$0.041578.1$1.00 in / $5.00 out
GPT-5.4 Mini (medium reasoning)$0.037184.6$0.75 in / $4.50 out
GPT-5.5 (standard)$0.247594.3$5.00 in / $30.00 out

Per-problem cost assumes 1,500 input tokens and 8,000 output tokens. AIME25 scores from each provider's published model card. Smaller reasoning models often hit shorter traces, so real bills can come in under these estimates.

Read the table this way: at 87.8 on AIME25, Qwen3-Next-Thinking lands within 7 points of GPT-5.5 while costing 38x less per problem. Against Haiku 4.5 it is 6x cheaper and scores 9.7 points higher. Gemini 3 Flash with thinking enabled is the closest peer on price and underperforms by 5.4 points.

DeepSeek V4 Flash is included to make the point honestly. It is cheaper per token but it is not a thinking model. If your task does not need a trace, V4 Flash beats this on cost. If your task benefits from one, V4 Flash is the wrong tool.

Where this model fits and where it does not

The right framing is not "Qwen3-Next-Thinking versus everything." It is which workloads cost less here than anywhere else, and which ones you should run elsewhere.

Pick this model for math evals that benefit from native CoT, agent loops where you actually want to see the reasoning trace for debugging, code review or planning tasks where the cost per call sits above $0.05 on closed-source reasoners, and any open-weight requirement (data residency, self-hosting, fine-tuning). The 262K native context covers most long-document reasoning use cases without YaRN tricks.

Skip it for high-volume non-reasoning bulk work (DeepSeek V4 Flash is cheaper and does not waste tokens on traces you do not need), production code generation where SWE-Bench Verified score is mission-critical (Qwen has not published one for this SKU yet), and anything requiring vision (the VL variant exists, this one is text-only).

For OpenAI-ecosystem teams, o4-mini or GPT-5.4 Mini probably wins on integration convenience even at higher cost. The Qwen savings are real but they require sending traffic to OpenRouter, Together, Fireworks, or a self-hosted endpoint. That switching cost matters more for small teams than the dollar savings do.

Three things to verify before committing

We do not have a SWE-Bench Verified number from Qwen. The community has been asking for one in the model's Hugging Face discussions; nothing official has shipped. If production code generation is your use case, run your own eval before swapping in. Qwen3-Coder-Next is a different SKU built specifically for that.

OpenRouter exposes 131K context, not the native 262K. That is a provider truncation, not a model limit. If you need full 262K (or the YaRN-extended ~1M), self-hosting on vLLM or hitting Alibaba DashScope directly is the path. Most production users do not need more than 131K, but the gap exists.

The two routes on OpenRouter ($0.0975/$0.78 versus $0.15/$1.20) point to the same model from different hosts. The cheaper route is sometimes rate-limited or quality- varies under load. The premium route is more consistent. Most teams want the cheap route as default and the premium one as fallback.

Sources