Mercury 2 outputs at 788 tokens per second for $0.75 per million. The diffusion math turns frontier reasoning pricing into a rounding error.
Inception Labs' Mercury 2 is the first diffusion-based reasoning LLM, and it ships at $0.25 input / $0.75 output per million tokens. Artificial Analysis measures it at 788 tokens per second, ranked first across 156 models surveyed. GPT-5.5 manages 72. Claude Opus 4.7 manages 48. The same workload that takes Opus almost six hours to finish takes Mercury 2 about 21 minutes, and the bill is 33 times smaller. The benchmarks tell a more complicated story, but the wall-clock math is the part that rewrites the routing table.

Most pricing comparisons rank models on dollars per million tokens and stop there. Mercury 2 breaks that frame because it sits in two separate budget tiers at the same time. On output dollars, it is in the Gemini Flash-Lite class. On wall-clock latency, it is in a tier no autoregressive model has reached. The combination changes which workloads make economic sense to route to a reasoning model in the first place. Below is the full pricing, speed, benchmark, and workload math, with caveats about where the diffusion approach still leaves quality on the table.
The price-and-speed table that does not fit on a single axis
Every reasoning launch is now graded on two numbers: cost per million output tokens and tokens per second. Mercury 2 is the first model where both are class-leading at once. The headline figures, sourced from Inception Labs and verified against Artificial Analysis third-party measurement:
| Model | Input / 1M | Output / 1M | Output speed | Context |
|---|---|---|---|---|
| Mercury 2 | $0.25 | $0.75 | 788 tok/s | 128K |
| Gemini 3.1 Flash-Lite | $0.25 | $1.50 | ~250 tok/s | 1M |
| Grok 4 Fast | $0.20 | $0.50 | ~180 tok/s | 2M |
| DeepSeek V4-Pro (promo) | $0.435 | $0.87 | 37 tok/s | 1M |
| Gemini 3.1 Pro | $2.00 | $12.00 | 123 tok/s | 2M |
| Claude Opus 4.7 | $5.00 | $25.00 | 48 tok/s | 200K |
| GPT-5.5 | $5.00 | $30.00 | 72 tok/s | 272K |
Note the speed column. Mercury 2's 788 tok/s is the Artificial Analysis measured figure; Inception's own claim is 1,009 tok/s on Blackwell, with peak headline figures up to 1,196. Whichever number is used, the gap to the second-fastest reasoning model is roughly 6x. Cached input drops Mercury 2 to $0.025/M, which makes prompt-cached pipelines effectively free on the input side.
What diffusion actually changes about the bill
Autoregressive decoding is sequential by construction. Token N waits for token N-1, which waits for the KV cache lookup, which waits for memory bandwidth. Throughput ceiling on a single H100 maxes out around 100 to 200 tokens per second for a 70B model. Diffusion language models invert this: a fixed-length output is initialised as noise (or masked tokens) and refined in parallel across 8 to 20 denoising steps. The number of forward passes is roughly constant regardless of output length, and the model has no KV cache to maintain.
That changes three things about the unit economics:
- Throughput per GPU is 5-10x higher. A single Blackwell node serving Mercury 2 at 1,000 tok/s replaces five to ten nodes serving Opus 4.7 at 100 tok/s. Capex per token falls accordingly, and Inception passes the savings through.
- Long outputs do not slow down linearly. On autoregressive models, 50K-token outputs take roughly 50x longer than 1K outputs. On Mercury 2, the denoising-step count grows sub-linearly, so a 50K answer finishes in seconds, not minutes. Long-form drafting becomes interactive instead of batched.
- Memory pressure does not scale with context. KV cache is the dominant memory cost on autoregressive inference past 100K tokens. Mercury 2 sidesteps that, which is why 128K context fits on a single GPU at hundreds of concurrent requests.
The catch: diffusion is newer for reasoning workloads, and per-token output quality still lags top autoregressive models on harder benchmarks. The gap shows up clearly on the next table.
The benchmarks Inception published, and the ones missing from the table
Inception's launch material publishes seven benchmark scores plus the Artificial Analysis intelligence index. Several headline benchmarks are absent: SWE-Bench Verified, MATH 500, and full MMLU. That is worth flagging because those are the benchmarks most production teams use to qualify a reasoning model.
| Benchmark | Mercury 2 | GPT-5.5 | Opus 4.7 |
|---|---|---|---|
| AIME 2025 | 91.1 | 94.6 | 92.4 |
| GPQA Diamond | 73.6 | 93.6 | 94.2 |
| HumanEval | 89.4 | 96.1 | 95.4 |
| LiveCodeBench | 67.3 | ~78 | ~75 |
| IFBench | 71.3 | n/p | n/p |
| Tau2 | 52.9 | n/p | n/p |
| SciCode | 38.4 | n/p | n/p |
| AA Intelligence Index | 33 (rank 35) | 72 | 69 |
AIME 2025 at 91.1 is genuinely strong - close to GPT-5.5 and within striking distance of Opus 4.7. The model can do hard math. GPQA Diamond at 73.6 is twenty points behind the leaders, which is where you see the diffusion-vs-autoregressive quality gap on graduate-level scientific reasoning. HumanEval at 89.4 is solid for single-function code but LiveCodeBench at 67.3 hints that multi-file or repo-scale coding will struggle.
The Artificial Analysis intelligence index lands Mercury 2 at rank 35, which is the cleanest single-number summary of where it sits. It is not in the same accuracy tier as GPT-5.5 or Opus 4.7. It is in the same accuracy tier as last year's frontier models, at a fraction of the price and a multiple of the speed.
Wall-clock economics for a million output tokens
Take 1,000,000 output tokens and divide by output speed to get wall-clock time. Multiply by output dollars per million for total cost. The product (dollars plus minutes) is the metric that actually moves user experience and infrastructure bills on long-output workloads.
| Model | Time for 1M output | Output cost | vs Mercury 2 |
|---|---|---|---|
| Mercury 2 | 21 min | $0.75 | baseline |
| DeepSeek V4-Pro (promo) | 7.5 hours | $0.87 | 21x slower, ~same cost |
| Gemini 3.1 Flash-Lite | ~67 min | $1.50 | 3x slower, 2x cost |
| Gemini 3.1 Pro | 2.3 hours | $12.00 | 6.4x slower, 16x cost |
| GPT-5.5 | 3.9 hours | $30.00 | 11x slower, 40x cost |
| Claude Opus 4.7 | 5.8 hours | $25.00 | 16x slower, 33x cost |
The DeepSeek row is the interesting one: same output cost as Mercury 2 (within 16 percent), but 21x slower wall-clock. If you are running synchronous user-facing workloads, the cost gap is not the deciding factor; the latency gap is. For background-batch workloads where latency is free, V4-Pro stays competitive on price alone. For everything user-facing, Mercury 2's speed turns into actual conversion and engagement deltas.
Four workload shapes, four bills
The same shapes used in our GPT-5.5 vs Opus 4.7 vs Gemini 3.1 Pro comparison, adjusted for Mercury 2's 128K context cap. Two of the heavier scenarios cap Mercury 2's input at 100K so the comparison stays apples-to-apples.
| Workload | Mercury 2 | Flash-Lite | V4-Pro promo | Opus 4.7 | GPT-5.5 |
|---|---|---|---|---|---|
| Chat reply (20K in / 5K out) | $0.009 | $0.013 | $0.013 | $0.225 | $0.250 |
| Long answer (5K in / 30K out) | $0.024 | $0.046 | $0.028 | $0.775 | $0.925 |
| Mid refactor (100K in / 30K out, capped) | $0.048 | $0.070 | $0.069 | $1.250 | $1.400 |
| 1B tokens / month (70/30 blend) | $400 | $625 | $566 | $11,000 | $12,500 |
At 1 billion blended tokens monthly, Mercury 2 saves roughly $10,600 every 30 days against Opus 4.7 and $12,100 against GPT-5.5. Annualised, that is between $127K and $145K per single-tenant workload. The catch is that Mercury 2 has to clear an accuracy bar that varies by task: for chat reply, summarisation, and structured extraction the bar is low and the savings stick. For agentic coding and graduate-science reasoning, the GPQA and LiveCodeBench gaps mean a meaningful chunk of those savings get clawed back by failures.
Where the headline win gets trimmed
The pricing and speed numbers are accurate. The win is real. A handful of structural issues show up only when you read the Artificial Analysis report footnotes or run Mercury 2 against your own evals, and they matter.
Verbosity is the most consequential. Artificial Analysis measured 69 million tokens generated across their evaluation suite versus a 26 million median for other models in the same class - roughly 160 percent more output tokens per task. That recovers a chunk of the per-million-token savings when you bill the actual workload, not the listed unit price. On the worst case, effective per-task output cost lands closer to $2 per million than the headline $0.75. Cached inputs do not help, because the verbosity is on the output side.
Context is the next gap. At 128K, Mercury 2 sits below tier for long-document RAG. Gemini 3.1 Pro offers 2M context (with a tiered price step at 200K). DeepSeek V4-Pro offers 1M context flat. Mercury 2 at 128K matches GPT-5.5 (272K is GPT-5.5's ceiling) and beats Opus 4.7 (200K), but it cannot fit a full codebase the way Gemini can. The 50K max output is generous and helps recover ground for long-form drafting, though it does not undo the input ceiling.
Then there is the SWE-Bench omission. Inception's benchmark suite skips it entirely. The closest proxy in their material is LiveCodeBench at 67.3, which suggests SWE-Bench would land somewhere around 55 to 65 percent if measured. That puts Mercury 2 below Hunyuan HY3 (74.4), well below DeepSeek V4-Pro (83.7), and far below GPT-5.5 (88.7) on the benchmark agentic coding tools care most about. For repo-scale coding, route to a specialised model. Mercury 2's niche is not autonomous code-writing; it is high-throughput reasoning where speed and unit cost both matter.
So what should actually move
High-throughput batch jobs, real-time chat replies, draft generation, summarisation, classification, structured-output extraction, and AIME-style math problems are the workloads where Mercury 2 becomes the new default. The combination of 788 tok/s and $0.75 per million output is unmatched for anything that needs to feel snappy and bill cheap. Migrating a customer support assistant from Opus 4.7 to Mercury 2 typically cuts the bill by about 95 percent and the user-perceived response time by roughly 90 percent at the same time. The migration cost is one round of evals and a prompt tweak to handle Mercury's verbosity.
Leave alone: anything that depends on SWE-Bench-grade autonomous coding (route to GPT-5.5 or Opus 4.7), anything that needs more than 128K context regularly (route to Gemini 3.1 Pro), and anything that demands the absolute top GPQA Diamond accuracy for graduate-research or certain regulatory analysis. For those workloads, Mercury 2 still earns a spot as a cheap front-end, taking on first-pass extraction or query decomposition before the slower, more accurate model handles the final step. The hybrid stack typically lands at under a tenth of the cost of running Opus 4.7 end-to-end with no meaningful quality regression.
Sources
- Inception Labs: Introducing Mercury 2 - Official pricing ($0.25/$0.75 per 1M), 1,009 tok/s on Blackwell, benchmark scores
- Artificial Analysis: Mercury 2 - Third-party measured 788 tok/s, intelligence index 33, verbosity footnote
- OpenRouter: inception/mercury-2 - Routing endpoint listing, 128K context confirmation
- BusinessWire: Mercury 2 launch press release - Full benchmark table, 5x speed claim vs Claude 4.5 Haiku and GPT-5 Mini
- The Neuron Daily: diffusion models coming for text - Peak 1,196 tok/s headline, industry framing
- Anthropic: Claude pricing - Opus 4.7 at $5/$25 per 1M for the speed/cost comparison rows
- OpenAI: API pricing - GPT-5.5 at $5/$30 per 1M, used for the 40x output-cost comparison