How much does Qwen3-Next-80B-A3B-Thinking cost per million tokens?

On OpenRouter the cheapest route is $0.0975 per million input tokens and $0.78 per million output tokens. A premium-routed copy of the same model is listed at $0.15 in and $1.20 out. The model is open weights under Apache 2.0, so self-hosting is also possible.

Why is Qwen3-Next so cheap to serve?

It is an 80B-parameter MoE that activates roughly 3B parameters per token (the A3B in the name). Inference cost scales with active parameters, not total. Providers serve it at the price of a 7B model while it reasons like an 80B one. The architecture combines Gated DeltaNet attention with high-sparsity MoE routing.

Is Qwen3-Next-Thinking better than DeepSeek V4 Flash for reasoning?

Different tools. DeepSeek V4 Flash is cheaper on output ($0.28 vs $0.78) but does not produce native thinking traces. Qwen3-Next-Thinking always emits a chain-of-thought, which inflates output bills but helps on math, code, and agent loops. For non-reasoning bulk work pick V4 Flash. For AIME-style problems or multi-step planning, the trace earns its cost.

What is the context window of Qwen3-Next-Thinking?

Native context is 262,144 tokens. With YaRN extension it stretches to roughly 1,010,000 tokens, though most providers serve a truncated 131,072 window. OpenRouter exposes 131K. Self-hosters can run the full 262K natively.

ResearchApril 30, 2026·8 min read

Qwen3-Next-Thinking: the cheapest reasoning model under $1/M output

Most pricing posts in 2026 chase the GPT-5.5 vs Opus 4.7 frontier. Look one tier down and there is a stranger fact: a model that costs $0.0975 in and $0.78 out can score 87.8 on AIME25 and 77.2 on GPQA. It is open weights, it is permissively licensed, and almost nobody is writing about it.

Glowing AI chip on a dark circuit board representing sparse MoE activation

Photo by Immo Wegmann on Unsplash

What it costs versus its peers

Model	Input / 1M	Output / 1M	Native trace	License
Qwen3-Next-80B-A3B-Thinking	$0.0975	$0.78	Yes	Apache 2.0
DeepSeek V4 Flash	$0.14	$0.28	No	Open
Qwen3-VL-235B Instruct	$0.20	$0.88	No	Open
Gemini 3 Flash	$0.50	$3.00	Optional	Closed
GPT-5.4 Mini	$0.75	$4.50	Optional	Closed
Claude Haiku 4.5	$1.00	$5.00	Optional	Closed
Qwen3-Max	$0.78	$3.90	No	Closed

Pricing from OpenRouter cheap-route listings (April 30, 2026), DeepSeek API docs, official Anthropic and Google rates. "Native trace" means the model emits a chain-of-thought by default without prompting tricks.

Why this is the model nobody is writing about

Qwen3-Next-80B-A3B-Thinking shipped from Alibaba in September 2025. Eight months later the AI press cycle has moved through GPT-5.4, GPT-5.5, Claude Opus 4.7, DeepSeek V4, and three Gemini revisions. None of those are open weights at this price tier. None of them produce native reasoning traces below $1 per million output tokens.

The benchmark scores got buried because Qwen released them alongside Qwen3-Max and Qwen3-VL the same week. Coverage focused on the flagship 235B variant. The 80B Thinking SKU is the better deal.

We pulled the numbers because a reader asked us why our pricing table did not include it. Once we ran the cost-per-task math, the answer was simple: nothing else is close on $/correct-AIME-answer. Below $1 output and above 85 on AIME25 is a combination only this model offers right now.

How 80B parameters get priced like 7B

The architecture matters here, not as trivia, but because it explains the price tag. Qwen3-Next is a high-sparsity Mixture of Experts. Total parameters: 80 billion. Active parameters per token: roughly 3 billion. The router picks 3B worth of experts for each token, the rest sit idle.

Serving cost on a GPU scales with active parameters and KV cache, not parameter count. A 3B-active model fits on a single H100 with room for batching. Throughput approaches what you would see from a Llama 3.1 8B deployment. The provider that fronts it on OpenRouter passes that math through to your bill.

The catch is quality. MoE routing is hit or miss; some queries activate the wrong experts and the answer reflects that. Qwen tunes this with a hybrid Gated DeltaNet plus Gated Attention setup that NVIDIA wrote up in detail in their blog about the architecture. The published benchmarks suggest the routing works well enough on math, coding, and tool use to compete with dense 30-32B models in the same family.

Net effect: the sparsity dividend lands on your invoice. You pay 7B-tier serving costs for 80B-tier output, on workloads where the router cooperates. That is a real arbitrage if you set up the right evals before committing.

Cost per AIME-style problem, with realistic trace lengths

Reasoning model pricing is misleading until you account for the trace. Thinking-style models emit long chains of internal reasoning before producing the final answer. Qwen warns explicitly on the model card that the trace is longer than its 30B predecessor. Real-world output token counts for a single AIME problem land between 4,000 and 10,000 tokens depending on difficulty.

Here is what a single hard problem actually costs. Assume 1,500 input tokens (problem statement plus a short system prompt) and 8,000 output tokens (typical for an AIME walk-through with answer):

Model	Cost / problem	AIME25 score	Notes
Qwen3-Next-Thinking	$0.0064	87.8	$0.0975 in / $0.78 out, 8K trace
DeepSeek V4 Flash	$0.0024	Not native reasoner	Cheap but no thinking trace
Gemini 3 Flash (with thinking)	$0.0247	82.4	$0.50 in / $3.00 out
Claude Haiku 4.5 (extended thinking)	$0.0415	78.1	$1.00 in / $5.00 out
GPT-5.4 Mini (medium reasoning)	$0.0371	84.6	$0.75 in / $4.50 out
GPT-5.5 (standard)	$0.2475	94.3	$5.00 in / $30.00 out

Per-problem cost assumes 1,500 input tokens and 8,000 output tokens. AIME25 scores from each provider's published model card. Smaller reasoning models often hit shorter traces, so real bills can come in under these estimates.

Read the table this way: at 87.8 on AIME25, Qwen3-Next-Thinking lands within 7 points of GPT-5.5 while costing 38x less per problem. Against Haiku 4.5 it is 6x cheaper and scores 9.7 points higher. Gemini 3 Flash with thinking enabled is the closest peer on price and underperforms by 5.4 points.

DeepSeek V4 Flash is included to make the point honestly. It is cheaper per token but it is not a thinking model. If your task does not need a trace, V4 Flash beats this on cost. If your task benefits from one, V4 Flash is the wrong tool.

Where this model fits and where it does not

The right framing is not "Qwen3-Next-Thinking versus everything." It is which workloads cost less here than anywhere else, and which ones you should run elsewhere.

Pick this model for math evals that benefit from native CoT, agent loops where you actually want to see the reasoning trace for debugging, code review or planning tasks where the cost per call sits above $0.05 on closed-source reasoners, and any open-weight requirement (data residency, self-hosting, fine-tuning). The 262K native context covers most long-document reasoning use cases without YaRN tricks.

Skip it for high-volume non-reasoning bulk work (DeepSeek V4 Flash is cheaper and does not waste tokens on traces you do not need), production code generation where SWE-Bench Verified score is mission-critical (Qwen has not published one for this SKU yet), and anything requiring vision (the VL variant exists, this one is text-only).

For OpenAI-ecosystem teams, o4-mini or GPT-5.4 Mini probably wins on integration convenience even at higher cost. The Qwen savings are real but they require sending traffic to OpenRouter, Together, Fireworks, or a self-hosted endpoint. That switching cost matters more for small teams than the dollar savings do.

Three things to verify before committing

We do not have a SWE-Bench Verified number from Qwen. The community has been asking for one in the model's Hugging Face discussions; nothing official has shipped. If production code generation is your use case, run your own eval before swapping in. Qwen3-Coder-Next is a different SKU built specifically for that.

OpenRouter exposes 131K context, not the native 262K. That is a provider truncation, not a model limit. If you need full 262K (or the YaRN-extended ~1M), self-hosting on vLLM or hitting Alibaba DashScope directly is the path. Most production users do not need more than 131K, but the gap exists.

The two routes on OpenRouter ($0.0975/$0.78 versus $0.15/$1.20) point to the same model from different hosts. The cheaper route is sometimes rate-limited or quality- varies under load. The premium route is more consistent. Most teams want the cheap route as default and the premium one as fallback.

Sources

- Qwen3-Next-80B-A3B-Thinking model card on Hugging Face: huggingface.co
- OpenRouter pricing and routes: openrouter.ai
- NVIDIA Technical Blog on Qwen3-Next hybrid MoE architecture: developer.nvidia.com
- Qwen3-Next launch announcement on Qwen blog: qwen.ai
- Alibaba Cloud Model Studio pricing reference: alibabacloud.com
- DeepSeek V4 official pricing: api-docs.deepseek.com
- Anthropic Claude pricing reference: platform.claude.com
- llm-stats Qwen3-Next-Thinking benchmark page: llm-stats.com

Compare all model pricing Calculate your API costs