Skip to main content
TC
TokenCost
ComparisonApril 5, 2026·7 min read

Reasoning models in 2026: $0.55 to $20 per million tokens, and when each tier makes sense

The gap between the cheapest and most expensive reasoning model API is 36x. DeepSeek R1 costs $0.55 per million input tokens. o3-pro costs $20. On math and science benchmarks, R1 nearly ties o3 at a quarter of the price. On coding tasks, the more expensive models win by 23+ percentage points. Here's what the data actually shows.

Dark server room with glowing rack lights representing AI reasoning model infrastructure

Photo by imgix on Unsplash

  • DeepSeek R1 nearly ties o3 on grad-level science (81.0% vs 83%) at a quarter of the input price
  • o4-mini beats R1 by 23+ points on SWE-bench - the gap is real for coding pipelines
  • o3-pro waits ~101 seconds before its first token; useful only as a background batch job
  • Anthropic and Google bundle thinking tokens in output pricing; OpenAI charges full output rate for them

What each model actually costs

Six reasoning models worth knowing about, with current API prices. All figures are per million tokens at standard (non-batch) rates.

ModelInput / 1MOutput / 1MCached inputContext
DeepSeek R1
DeepSeek
$0.55$2.19$0.14128K
o4 Mini
OpenAI
$1.10$4.40-200K
o3
OpenAI
$2.00$8.00-200K
Gemini 2.5 Pro
Google
$1.25$10.00$0.1251M
Claude Sonnet 4.6
Anthropic
$3.00$15.00$0.301M
o3-pro
OpenAI
$20.00$80.00-200K

Prices from official API docs as of April 2026. Gemini 2.5 Pro input price above 200K tokens doubles to $2.50/M. Claude Sonnet 4.6 extended thinking is included in standard output pricing. o3-pro max output is 100K tokens.

The numbers that jump out: R1 at $0.55 input vs o3-pro at $20 input - o3-pro costs roughly 36 dollars for every dollar R1 charges. But the comparison most teams actually care about is R1 vs o3 (73% cheaper on input) or R1 vs o4-mini (half the price), since o3-pro occupies a very specific niche covered below.

One thing this table doesn't show: OpenAI charges full output token rates for reasoning tokens (the internal chain-of-thought steps). Anthropic and Google bundle thinking tokens into their standard output pricing. For heavy reasoning tasks that produce long chains before answering, this matters more than the input price.

Where R1 wins and where it doesn't

The benchmark picture for R1-0528 vs o4-mini is genuinely split in a way that matters for how you use the model. On math and science, R1 is competitive. On software engineering tasks, there's a gap that's too large to ignore.

BenchmarkDeepSeek R1o4 MiniNotes
GPQA Diamond
Graduate science reasoning
81.0%81.4%Effectively tied
AIME 2025
Math olympiad
87.5%92.7%o4-mini +5.2 pts
SWE-bench Verified
Real-world coding
44.6-57.6%68.1-80.8%o4-mini wins by 23+ pts
Aider Polyglot
Multi-language code editing
71.6%68.9%R1 edges ahead
LiveCodeBench
Competitive programming
73.3%n/ao4-mini not reported

R1 scores from the DeepSeek R1-0528 release notes. o4-mini scores from OpenAI's system card. SWE-bench ranges reflect different scaffold configurations. Treat comparisons as directional.

The GPQA Diamond tie is the most useful number here. Graduate-level science reasoning at $0.55/M vs $1.10/M - if that's your workload (medical Q&A, research synthesis, technical analysis), R1 is the better economic choice. You're paying about twice as much for o4-mini and getting, on this specific task, essentially the same accuracy.

SWE-bench is where it gets uncomfortable for R1. A 23+ point gap on production coding tasks isn't a benchmark quirk - it reflects something real about how well each model follows software engineering constraints, handles edge cases, and avoids introducing bugs. Independent analysis from Artificial Analysis puts o4-mini notably ahead of R1 on coding tasks across multiple configurations. If your pipeline does code generation at scale, test both before committing.

o3-pro is a batch job, not a chatbot

The $20 input price is only part of why o3-pro is a specialized tool. The other part: its median time-to-first-token is around 101 seconds. Nearly two minutes before you see the first output token, followed by 29 tokens per second. For context, o3 runs at ~12 tokens per second with a fraction of that wait.

OpenAI explicitly recommends async mode for o3-pro to avoid timeouts. This isn't a bug - it's the mechanism. o3-pro uses more compute time to reason before responding, which is why it's described as giving "the most reliable responses." The premium buys extended thinking time, not faster answers.

The practical implication: o3-pro belongs in pipelines that can tolerate latency. Overnight batch analysis, complex document review where you're not waiting at a screen, research tasks where getting the answer right matters more than getting it quickly. If you need interactive reasoning, o3 at $2/M is a much better fit.

How thinking tokens get billed

Every reasoning model generates internal chain-of-thought tokens before producing the final answer. How these tokens get priced differs by provider, and it meaningfully affects total cost on reasoning-heavy tasks.

ProviderHow thinking tokens are billed
OpenAI (o-series)Charged at full output token rate. Reasoning tokens appear in the output field of usage.
Anthropic (Claude Sonnet 4.6)Extended thinking tokens billed at standard output rate ($15/M). No separate thinking surcharge.
Google (Gemini 2.5 Pro)Thinking tokens included in output pricing ($10/M under 200K, $15/M above). Thinking budget is configurable.
DeepSeek (R1)Reasoning tokens billed at standard output rate ($2.19/M). Cache hit on inputs: $0.14/M.

In practice, a complex math problem on o4-mini might produce 2,000-8,000 reasoning tokens before the final answer. At $4.40/M output, that's $0.0088-$0.035 in reasoning tokens per query before counting the actual response. On the same task, Gemini 2.5 Pro charges those thinking tokens at $10/M - higher rate, but you get a 1M context window and the ability to cap thinking budget.

DeepSeek R1's cached input pricing ($0.14/M) is worth flagging for repeated long-context calls - like running analysis over the same codebase or document set across many queries. At 74% off standard input pricing, the cache hit rate can substantially change the math on high-volume workloads.

Which model for which job

DeepSeek R1 at $0.55/M

Math, science reasoning, research analysis, and any task where GPQA Diamond-class performance is the bar. At 81% on GPQA and 87.5% on AIME 2025, R1 is competitive with o3 and o4-mini on these tasks at a fraction of the price. The 128K context window is the main constraint - if you're working with long documents, that ceiling matters.

R1 also makes sense for high-volume inference where you need reasoning capability but can't absorb OpenAI pricing at scale. The cached input price of $0.14/M is the best cached pricing in this tier.

o4-mini at $1.10/M

Code generation pipelines. The SWE-bench gap over R1 (68-80% vs 44-57%) is real enough to pay for. o4-mini also supports tool use during reasoning - it can call tools, check the output, and incorporate results into its chain of thought before responding. R1 doesn't do this the same way.

o3 at $2/M

The step up from o4-mini when you need stronger reasoning without paying o3-pro rates. On GPQA Diamond, o3 scores 83-87.7% vs o4-mini's 81.4% - not dramatic, but measurable. Worth testing when o4-mini answers feel incomplete on complex multi-step tasks.

o3-pro at $20/M

Narrow but real use case: high-stakes decisions where latency doesn't matter. Overnight document analysis, legal or financial review where being wrong is expensive, research tasks run as background jobs. Given that o3 at $2/M is already strong, o3-pro is hard to justify for most workloads. The 101-second wait settles this for anything interactive.

Where things stand

The reasoning model market has split into two tiers that are genuinely different products. R1, o4-mini, and o3 are interactive reasoning models - they fit in real-time pipelines and the pricing, while higher than standard models, is workable at scale. o3-pro is something else: expensive, slow, and designed for batch work where extended thinking time changes the quality of the answer.

For most teams, o4-mini for coding tasks and R1 for math or science analysis covers the bulk of reasoning needs at a combined cost well below the premium tier. Test both on your actual data before committing - benchmark scores are directional, not a substitute for running your workload.

Sources

Compare all model pricesCalculate your API costs