Qwen3 Coder Next is still $0.11 per million input. Kimi K2.6 costs 7x more, GPT-5.5 costs 41x more, and the benchmark gaps do not justify either.
Qwen3 Coder Next shipped on February 3. Kimi K2.6 followed on April 20. GPT-5.5 arrived on April 23. Three months later the price chart is the most lopsided it has been in any coding-model tier we have written about. Qwen3 Coder Next on OpenRouter bills $0.11 input and $0.80 output. Kimi K2.6 on OpenRouter bills $0.74 and $3.50. GPT-5.5 standard bills $5 and $30. SWE-Bench Verified ranks them 70.6, 80.2, and unpublished. Below is the routing math, the benchmark gap by task, and the workloads where the cheap one is now the obvious answer.

Photo by Chris Ried on Unsplash
The price chart, with every provider that matters
Qwen3 Coder Next is the rare model where provider variation is bigger than the gap between some competitor pairs. OpenRouter routes the cheapest endpoint at $0.11 input. Alibaba DashScope direct is roughly 3x that. Bedrock, Hugging Face, and Vercel are roughly 5x. Pick the provider that matches your latency and uptime needs, but pick consciously.
| Model / Provider | Input / 1M | Output / 1M | Cache read | Context |
|---|---|---|---|---|
| Qwen3 Coder Next (OpenRouter) | $0.11 | $0.80 | $0.07 | 256K |
| Qwen3 Coder Next (Alibaba) | $0.30 | $1.50 | n/a | 256K |
| Qwen3 Coder Next (Bedrock / HF / Vercel) | $0.50 | $1.20 | n/a | 256K |
| Kimi K2.6 (OpenRouter) | $0.74 | $3.50 | $0.25 | 256K |
| Kimi K2.6 (Moonshot direct) | $0.60 | $2.50 | n/p | 256K |
| GPT-5.5 (standard) | $5.00 | $30.00 | n/p | 1M |
| GPT-5.5 (batch) | $2.50 | $15.00 | n/p | 1M |
On the cheapest tier for each, Kimi K2.6 costs about 6.7x what Qwen does on input and 4.4x on output. GPT-5.5 input costs are 45x Qwen's and output is 38x. GPT-5.5 batch trims those multipliers roughly in half but still leaves both sides an order of magnitude above Qwen. The OpenRouter cache tier for Qwen3 Coder Next is the unsung hero here: $0.07 per million on cached input means a long-running coding agent with a stable system prompt rounds the input bill to almost zero. The Artificial Analysis price chart for Qwen3 Coder Next tracks the same delta across its provider sample.
Benchmark scores, with the omissions called out
Every model in this comparison cherry-picks which benchmarks to publish. Qwen published the most. GPT-5.5 published the fewest. Reading the scores as if they are like-for-like would be a mistake, so the columns marked "not published" are a real gap, not a missing reference.
| Benchmark | Qwen3 Coder Next | Kimi K2.6 | GPT-5.5 |
|---|---|---|---|
| SWE-Bench Verified | 70.6 | 80.2 | not published |
| SWE-Bench Pro | 44.3 | 58.6 | 58.6 |
| SWE-Bench Multilingual | 62.8 | 76.7 | not published |
| Terminal-Bench 2.0 | 36.2 | 66.7 | 82.7 |
| LiveCodeBench v6 | 58.9 | 89.6 | not published |
| Codeforces (rating) | 2100 | n/p | n/p |
| AA Intelligence Index | ~46 | 54 | 60 |
SWE-Bench Pro is the one benchmark all three report. Kimi K2.6 and GPT-5.5 tie at 58.6. Qwen sits 14.3 points behind at 44.3. That is a real quality gap on the hardest, most repo-realistic agentic coding benchmark in public circulation. For tasks that fail open (the patch does not apply, the test still fails), the gap shows up as retries; for tasks that fail silently (incorrect patch accepted), it shows up as bugs. Choose accordingly.
On SWE-Bench Verified, Kimi K2.6 leads cleanly at 80.2 versus Qwen at 70.6. That 9.6-point gap is the routing cost of going cheap on repo-level work. GPT-5.5 did not publish a Verified score, which is conspicuous; OpenAI led with SWE-Bench Pro (where it ties Kimi) and Terminal-Bench 2.0 (where it leads everything). For comparison, Claude Opus 4.7 reports 64.3 on SWE-Bench Pro, which currently tops the public leaderboard.
Terminal-Bench 2.0 is the most punishing column for Qwen. 36.2 versus 82.7 is a 46-point gap, and Terminal-Bench is the benchmark that maps closest to autonomous shell-driven agent work (Codex, OpenHands, terminal-tools scaffolds). If your routing target is "run an unattended agent against my machine for an hour," Qwen3 Coder Next is the wrong choice regardless of price.
Cost per task, four real workload shapes
Pricing per million tokens is a starting point, not a unit your bill arrives in. Workload shape decides everything. Here are four shapes coding teams actually run, with the cheapest tier for each model.
| Workload | Qwen (OR) | Qwen (DashScope) | Kimi K2.6 (OR) | GPT-5.5 | GPT-5.5 batch |
|---|---|---|---|---|---|
| Quick edit (10K in / 2K out) | $0.0027 | $0.006 | $0.014 | $0.110 | $0.055 |
| Agentic turn (50K in / 10K out) | $0.0135 | $0.030 | $0.072 | $0.550 | $0.275 |
| Big refactor (200K in / 30K out) | $0.046 | $0.105 | $0.253 | $1.90 | $0.95 |
| 1B tokens / month (70/30 blend) | $317 | $660 | $1,568 | $12,500 | $6,250 |
At the 1B-token monthly mark, the gap between Qwen on OpenRouter and GPT-5.5 standard is $12,183. Annualised that is $146,196 for a single workload. For a team that runs 5 such workloads in parallel, the savings reach three quarters of a million dollars before any retries are accounted for. The catch: Qwen has to clear the quality bar without too many retries, or the savings get clawed back. With Qwen at 70.6 SWE-Bench Verified and GPT-5.5 at an undisclosed-but- comparable number, the cost of one extra retry per ten tasks (10% retry rate) is still less than $35 per 1B tokens. The math holds up.
Where Qwen3 Coder Next actually wins, and where it does not
The clean wins for Qwen3 Coder Next, ranked by how often they show up in real engineering work:
- Cached coding agents. If your agent uses a stable system prompt over a long session, the OpenRouter cached-input tier at $0.07 per million makes the input bill effectively zero. Output stays at $0.80. A 100K-input, 30K-output turn with full cache hit bills $0.031. The same turn on GPT-5.5 bills $1.40.
- High-volume function-level codegen. EvalPlus 86.6, MultiPL-E 88.2, HumanEval-tier results say Qwen is perfectly fine at writing individual functions, snippets, and tests. For backend boilerplate and CRUD scaffolding, the quality gap to Kimi and GPT-5.5 is too small to justify the price gap.
- Self-hosted inference. 80B total / 3B active fits on a single H100 at Q4_K_M (52 GB). Ollama already ships
qwen3-coder-next. For teams that cannot send code to a third-party API for legal reasons, this is the only model in this comparison that runs on hardware you already own. Kimi K2.6 at 1T parameters needs multi-node serving infra. - Synthetic-data generation. Generating training pairs, test fixtures, or fuzzing inputs at 80 to 120 tokens per second per H100 means you can saturate a corpus pipeline at a fraction of the API cost. Apache 2.0 means the outputs are unencumbered.
The places to keep routing to GPT-5.5 or Kimi K2.6:
- Terminal-driven agents. If the agent is responsible for shell tool calls, environment setup, and long-horizon terminal sessions, the 46-point Terminal-Bench 2.0 gap is going to bite. Route to GPT-5.5 standard (or Codex Fast) and pay the price premium.
- Multilingual codebases. Kimi K2.6 leads SWE-Bench Multilingual at 76.7 versus Qwen at 62.8. For repos with significant non-English-comment density or non-ASCII identifiers, the 13.9-point gap shows up as failed patches.
- LiveCodeBench-style competitive coding. Kimi K2.6 at 89.6 versus Qwen at 58.9 is a 30-point gap on a benchmark that maps to algorithmic problem solving, code-golf, and competitive coding. Qwen does fine at Codeforces 2100 rating, but pure correctness on harder benchmarks is a Kimi advantage.
- Whole-repo context past 256K. Qwen3 Coder Next is 256K native, extendable to 1M on supporting providers. GPT-5.5 ships at 1M out of the box. For agents that need to load a 500K-token codebase into context every turn, GPT-5.5 is operationally simpler.
One more constraint that matters and rarely gets flagged: Qwen3 Coder Next is non-thinking only. No visible chain-of-thought blocks, no exposed reasoning traces. For most production agents this is a feature (faster TTFT, lower output token count, cleaner outputs). For evals where you score partial credit on reasoning quality, or for use cases that benefit from chain-of-thought transparency, Kimi K2.6 and GPT-5.5 are better matches.
The three-month-old model that did not lose much ground
Qwen3 Coder Next shipped on February 3, 2026. By coding-model standards, that is geologically old. Kimi K2.6 and GPT-5.5 both arrived in late April with higher numbers on most benchmarks. The reasonable expectation was that Qwen would be obsolete by May. What actually happened is that the new entrants slotted in above Qwen on quality but at 5x to 40x the price, and Qwen kept its value tier essentially uncontested.
The only model that genuinely threatens Qwen3 Coder Next in the cheap tier is DeepSeek V4-Flash at $0.07 input and $0.42 output (see our DeepSeek V4-Flash budget tier teardown), but V4-Flash is general-purpose, not coding-specialised, and trails on SWE-Bench Verified by a margin that varies with scaffold. For dedicated coding workloads at the value tier, Qwen3 Coder Next is still the clear pick.
Looking at the wider routing landscape we covered in our DeepSeek V4 vs GPT-5.5 vs Claude Opus 4.7 piece, the pattern from this last quarter is consistent: frontier models keep arriving at top-tier prices, and value-tier models keep capturing more of the actual workload. Qwen3 Coder Next is the cleanest version of that pattern in the coding-specialised category.
Sources
- Qwen: Qwen3 Coder Next official blog - Architecture, parameter count, benchmark suite
- Qwen3 Coder Next technical report (arXiv 2603.00729) - SWE-Bench Verified 70.6, SWE-Bench Pro 44.3, Codeforces 2100
- OpenRouter: qwen/qwen3-coder-next - $0.11/$0.80 per 1M, $0.07 cache read, 256K context
- CloudPrice: Alibaba Qwen3 Coder Next provider matrix - DashScope $0.30/$1.50, Bedrock / HF / Vercel $0.50/$1.20
- Moonshot: Kimi K2.6 official blog - SWE-Bench Verified 80.2, LiveCodeBench v6 89.6, AA Intelligence Index 54
- OpenRouter: moonshotai/kimi-k2.6 - $0.74/$3.50 per 1M, 256K context
- OpenAI: Introducing GPT-5.5 - Standard $5/$30, batch $2.50/$15, Terminal-Bench 2.0 82.7, SWE-Bench Pro 58.6
- Artificial Analysis: Kimi K2.6 - Third-party intelligence index, blended pricing