How much does Inception Labs Mercury 2 cost per million tokens?

Mercury 2 lists at $0.25 per million input tokens and $0.75 per million output tokens, per Inception Labs and Artificial Analysis. Cached input drops to $0.025 per million, a 90 percent discount. On a typical 3-to-1 input-to-output blend the effective rate is about $0.38 per million blended tokens, which is the cheapest reasoning-grade pricing currently published.

How fast is Mercury 2 compared to GPT-5.5 and Claude Opus 4.7?

Inception claims 1,009 tokens per second on NVIDIA Blackwell GPUs. Artificial Analysis measured 788 tokens per second in third-party benchmarking, which still ranks first across 156 models surveyed. By comparison, Artificial Analysis measures GPT-5.5 at roughly 72 tok/s, Claude Opus 4.7 at 48 to 61 tok/s, and Gemini 3.1 Pro at about 123 tok/s. On wall-clock time to produce a million output tokens, Mercury 2 finishes in roughly 21 minutes; GPT-5.5 takes about 3.9 hours, Opus 4.7 about 5.8 hours.

Why is Mercury 2 so much cheaper than autoregressive reasoning models?

Mercury 2 is a diffusion language model. Instead of generating one token after another, it refines multiple tokens in parallel through 8 to 20 denoising steps. The architecture is KV-cache-free, which removes a major memory bottleneck that limits throughput on autoregressive models. On the same Blackwell hardware that delivers 50 to 150 tok/s for GPT-5.5 class models, Mercury 2 delivers 1,000-plus. That throughput multiplier flows directly into the unit economics, which is how Inception can price reasoning output at $0.75 per million.

What benchmarks does Mercury 2 publish, and which are missing?

Inception published AIME 2025 at 91.1, GPQA Diamond at 73.6, HumanEval at 89.4, IFBench at 71.3, LiveCodeBench at 67.3, SciCode at 38.4, and Tau2 at 52.9. Artificial Analysis places Mercury 2 at intelligence index 33, ranked 35 of 156 models. Notably absent: SWE-Bench Verified, MATH 500, and full MMLU. The verbosity-adjusted intelligence per dollar still leads the budget tier, but Mercury 2 sits well below GPT-5.5 and Opus 4.7 on raw benchmark accuracy.

What is Mercury 2's context window and where does it cap out?

Mercury 2 supports a 128,000-token context window with a 50,000-token max output. That puts it below Gemini 3.1 Pro (2M context), DeepSeek V4-Pro (1M context), and Grok 4 Fast (2M context), but matches GPT-5.5 (272K) and Claude Opus 4.7 (200K) for most real-world chat and coding workloads. The 50K output limit is generous; Mercury 2's speed advantage compounds on long outputs because diffusion throughput scales differently from autoregressive decoding.

Model ReleaseMay 13, 2026·11 min read

Mercury 2 outputs at 788 tokens per second for $0.75 per million. The diffusion math turns frontier reasoning pricing into a rounding error.

Inception Labs' Mercury 2 is the first diffusion-based reasoning LLM, and it ships at $0.25 input / $0.75 output per million tokens. Artificial Analysis measures it at 788 tokens per second, ranked first across 156 models surveyed. GPT-5.5 manages 72. Claude Opus 4.7 manages 48. The same workload that takes Opus almost six hours to finish takes Mercury 2 about 21 minutes, and the bill is 33 times smaller. The benchmarks tell a more complicated story, but the wall-clock math is the part that rewrites the routing table.

Flowing particle ribbon on black background suggesting diffusion-based parallel token generation

Photo by Visax on Unsplash

Most pricing comparisons rank models on dollars per million tokens and stop there. Mercury 2 breaks that frame because it sits in two separate budget tiers at the same time. On output dollars, it is in the Gemini Flash-Lite class. On wall-clock latency, it is in a tier no autoregressive model has reached. The combination changes which workloads make economic sense to route to a reasoning model in the first place. Below is the full pricing, speed, benchmark, and workload math, with caveats about where the diffusion approach still leaves quality on the table.

The price-and-speed table that does not fit on a single axis

Every reasoning launch is now graded on two numbers: cost per million output tokens and tokens per second. Mercury 2 is the first model where both are class-leading at once. The headline figures, sourced from Inception Labs and verified against Artificial Analysis third-party measurement:

Model	Input / 1M	Output / 1M	Output speed	Context
Mercury 2	$0.25	$0.75	788 tok/s	128K
Gemini 3.1 Flash-Lite	$0.25	$1.50	~250 tok/s	1M
Grok 4 Fast	$0.20	$0.50	~180 tok/s	2M
DeepSeek V4-Pro (promo)	$0.435	$0.87	37 tok/s	1M
Gemini 3.1 Pro	$2.00	$12.00	123 tok/s	2M
Claude Opus 4.7	$5.00	$25.00	48 tok/s	200K
GPT-5.5	$5.00	$30.00	72 tok/s	272K

Note the speed column. Mercury 2's 788 tok/s is the Artificial Analysis measured figure; Inception's own claim is 1,009 tok/s on Blackwell, with peak headline figures up to 1,196. Whichever number is used, the gap to the second-fastest reasoning model is roughly 6x. Cached input drops Mercury 2 to $0.025/M, which makes prompt-cached pipelines effectively free on the input side.

What diffusion actually changes about the bill

Autoregressive decoding is sequential by construction. Token N waits for token N-1, which waits for the KV cache lookup, which waits for memory bandwidth. Throughput ceiling on a single H100 maxes out around 100 to 200 tokens per second for a 70B model. Diffusion language models invert this: a fixed-length output is initialised as noise (or masked tokens) and refined in parallel across 8 to 20 denoising steps. The number of forward passes is roughly constant regardless of output length, and the model has no KV cache to maintain.

That changes three things about the unit economics:

Throughput per GPU is 5-10x higher. A single Blackwell node serving Mercury 2 at 1,000 tok/s replaces five to ten nodes serving Opus 4.7 at 100 tok/s. Capex per token falls accordingly, and Inception passes the savings through.
Long outputs do not slow down linearly. On autoregressive models, 50K-token outputs take roughly 50x longer than 1K outputs. On Mercury 2, the denoising-step count grows sub-linearly, so a 50K answer finishes in seconds, not minutes. Long-form drafting becomes interactive instead of batched.
Memory pressure does not scale with context. KV cache is the dominant memory cost on autoregressive inference past 100K tokens. Mercury 2 sidesteps that, which is why 128K context fits on a single GPU at hundreds of concurrent requests.

The catch: diffusion is newer for reasoning workloads, and per-token output quality still lags top autoregressive models on harder benchmarks. The gap shows up clearly on the next table.

The benchmarks Inception published, and the ones missing from the table

Inception's launch material publishes seven benchmark scores plus the Artificial Analysis intelligence index. Several headline benchmarks are absent: SWE-Bench Verified, MATH 500, and full MMLU. That is worth flagging because those are the benchmarks most production teams use to qualify a reasoning model.

Benchmark	Mercury 2	GPT-5.5	Opus 4.7
AIME 2025	91.1	94.6	92.4
GPQA Diamond	73.6	93.6	94.2
HumanEval	89.4	96.1	95.4
LiveCodeBench	67.3	~78	~75
IFBench	71.3	n/p	n/p
Tau2	52.9	n/p	n/p
SciCode	38.4	n/p	n/p
AA Intelligence Index	33 (rank 35)	72	69

AIME 2025 at 91.1 is genuinely strong - close to GPT-5.5 and within striking distance of Opus 4.7. The model can do hard math. GPQA Diamond at 73.6 is twenty points behind the leaders, which is where you see the diffusion-vs-autoregressive quality gap on graduate-level scientific reasoning. HumanEval at 89.4 is solid for single-function code but LiveCodeBench at 67.3 hints that multi-file or repo-scale coding will struggle.

The Artificial Analysis intelligence index lands Mercury 2 at rank 35, which is the cleanest single-number summary of where it sits. It is not in the same accuracy tier as GPT-5.5 or Opus 4.7. It is in the same accuracy tier as last year's frontier models, at a fraction of the price and a multiple of the speed.

Wall-clock economics for a million output tokens

Take 1,000,000 output tokens and divide by output speed to get wall-clock time. Multiply by output dollars per million for total cost. The product (dollars plus minutes) is the metric that actually moves user experience and infrastructure bills on long-output workloads.

Model	Time for 1M output	Output cost	vs Mercury 2
Mercury 2	21 min	$0.75	baseline
DeepSeek V4-Pro (promo)	7.5 hours	$0.87	21x slower, ~same cost
Gemini 3.1 Flash-Lite	~67 min	$1.50	3x slower, 2x cost
Gemini 3.1 Pro	2.3 hours	$12.00	6.4x slower, 16x cost
GPT-5.5	3.9 hours	$30.00	11x slower, 40x cost
Claude Opus 4.7	5.8 hours	$25.00	16x slower, 33x cost

The DeepSeek row is the interesting one: same output cost as Mercury 2 (within 16 percent), but 21x slower wall-clock. If you are running synchronous user-facing workloads, the cost gap is not the deciding factor; the latency gap is. For background-batch workloads where latency is free, V4-Pro stays competitive on price alone. For everything user-facing, Mercury 2's speed turns into actual conversion and engagement deltas.

Four workload shapes, four bills

The same shapes used in our GPT-5.5 vs Opus 4.7 vs Gemini 3.1 Pro comparison, adjusted for Mercury 2's 128K context cap. Two of the heavier scenarios cap Mercury 2's input at 100K so the comparison stays apples-to-apples.

Workload	Mercury 2	Flash-Lite	V4-Pro promo	Opus 4.7	GPT-5.5
Chat reply (20K in / 5K out)	$0.009	$0.013	$0.013	$0.225	$0.250
Long answer (5K in / 30K out)	$0.024	$0.046	$0.028	$0.775	$0.925
Mid refactor (100K in / 30K out, capped)	$0.048	$0.070	$0.069	$1.250	$1.400
1B tokens / month (70/30 blend)	$400	$625	$566	$11,000	$12,500

At 1 billion blended tokens monthly, Mercury 2 saves roughly $10,600 every 30 days against Opus 4.7 and $12,100 against GPT-5.5. Annualised, that is between $127K and $145K per single-tenant workload. The catch is that Mercury 2 has to clear an accuracy bar that varies by task: for chat reply, summarisation, and structured extraction the bar is low and the savings stick. For agentic coding and graduate-science reasoning, the GPQA and LiveCodeBench gaps mean a meaningful chunk of those savings get clawed back by failures.

Where the headline win gets trimmed

The pricing and speed numbers are accurate. The win is real. A handful of structural issues show up only when you read the Artificial Analysis report footnotes or run Mercury 2 against your own evals, and they matter.

Verbosity is the most consequential. Artificial Analysis measured 69 million tokens generated across their evaluation suite versus a 26 million median for other models in the same class - roughly 160 percent more output tokens per task. That recovers a chunk of the per-million-token savings when you bill the actual workload, not the listed unit price. On the worst case, effective per-task output cost lands closer to $2 per million than the headline $0.75. Cached inputs do not help, because the verbosity is on the output side.

Context is the next gap. At 128K, Mercury 2 sits below tier for long-document RAG. Gemini 3.1 Pro offers 2M context (with a tiered price step at 200K). DeepSeek V4-Pro offers 1M context flat. Mercury 2 at 128K matches GPT-5.5 (272K is GPT-5.5's ceiling) and beats Opus 4.7 (200K), but it cannot fit a full codebase the way Gemini can. The 50K max output is generous and helps recover ground for long-form drafting, though it does not undo the input ceiling.

Then there is the SWE-Bench omission. Inception's benchmark suite skips it entirely. The closest proxy in their material is LiveCodeBench at 67.3, which suggests SWE-Bench would land somewhere around 55 to 65 percent if measured. That puts Mercury 2 below Hunyuan HY3 (74.4), well below DeepSeek V4-Pro (83.7), and far below GPT-5.5 (88.7) on the benchmark agentic coding tools care most about. For repo-scale coding, route to a specialised model. Mercury 2's niche is not autonomous code-writing; it is high-throughput reasoning where speed and unit cost both matter.

So what should actually move

High-throughput batch jobs, real-time chat replies, draft generation, summarisation, classification, structured-output extraction, and AIME-style math problems are the workloads where Mercury 2 becomes the new default. The combination of 788 tok/s and $0.75 per million output is unmatched for anything that needs to feel snappy and bill cheap. Migrating a customer support assistant from Opus 4.7 to Mercury 2 typically cuts the bill by about 95 percent and the user-perceived response time by roughly 90 percent at the same time. The migration cost is one round of evals and a prompt tweak to handle Mercury's verbosity.

Leave alone: anything that depends on SWE-Bench-grade autonomous coding (route to GPT-5.5 or Opus 4.7), anything that needs more than 128K context regularly (route to Gemini 3.1 Pro), and anything that demands the absolute top GPQA Diamond accuracy for graduate-research or certain regulatory analysis. For those workloads, Mercury 2 still earns a spot as a cheap front-end, taking on first-pass extraction or query decomposition before the slower, more accurate model handles the final step. The hybrid stack typically lands at under a tenth of the cost of running Opus 4.7 end-to-end with no meaningful quality regression.

Sources

Inception Labs: Introducing Mercury 2 - Official pricing ($0.25/$0.75 per 1M), 1,009 tok/s on Blackwell, benchmark scores
Artificial Analysis: Mercury 2 - Third-party measured 788 tok/s, intelligence index 33, verbosity footnote
OpenRouter: inception/mercury-2 - Routing endpoint listing, 128K context confirmation
BusinessWire: Mercury 2 launch press release - Full benchmark table, 5x speed claim vs Claude 4.5 Haiku and GPT-5 Mini
The Neuron Daily: diffusion models coming for text - Peak 1,196 tok/s headline, industry framing
Anthropic: Claude pricing - Opus 4.7 at $5/$25 per 1M for the speed/cost comparison rows
OpenAI: API pricing - GPT-5.5 at $5/$30 per 1M, used for the 40x output-cost comparison

Compare all model prices Calculate your API cost