Skip to main content
TokenCost logoTokenCost
ComparisonMay 10, 2026·10 min read

GPT-5.5 vs Opus 4.7 vs Gemini 3.1 Pro: the cheapest one depends entirely on whether you cross 200K tokens

On a 50K-token coding request, Gemini 3.1 Pro is the cheapest at $0.22, Opus 4.7 next at $0.50, and GPT-5.5 last at $0.55. The numbers feel orderly. They do not stay orderly. As soon as your workload pushes past 200K input tokens, Gemini steps to a higher tier. Past 272K, GPT-5.5 doubles input pricing and lifts output by half. Opus stays nominally flat but bills 25 to 37 percent more tokens than Opus 4.6 did for identical inputs. Every one of these three flagships has a hidden cost cliff. Knowing where the cliffs sit is the difference between a $3.56 bill and an $8.90 bill on the same 800K-token call.

Abstract dark gold flowing lines representing the shifting cost balance between three flagship LLMs

Photo by Mehdi Mirzaie on Unsplash

Three flagships, three different ways the price moves on you. The cliffs nobody puts on the comparison table:

ModelWhere the bill changes shape
GPT-5.5$5/$30 below 272K tokens. $10/$45 above. Whole session, not the tail.
Opus 4.7$5/$25 flat. New tokenizer bills 25-37% more tokens on the same input.
Gemini 3.1 Pro$2/$12 below 200K. $4/$18 above. Whole session too.

The list price, before the asterisks

All three landed within five weeks of each other. Opus 4.7 hit GA on April 16, GPT-5.5 on April 23, and Gemini 3.1 Pro had been in public preview since February 19. The sticker rates per million tokens look like this:

ModelInputOutputCached inputTier flip
GPT-5.5$5.00$30.00$0.502x in / 1.5x out above 272K
Claude Opus 4.7$5.00$25.00$0.50None on price; 1.0-1.35x tokenizer
Gemini 3.1 Pro$2.00$12.00$0.50$4 in / $18 out above 200K

Read just the first three columns and Gemini wins by 2.5x on input and 2.4x on output. That is the headline most coverage stops at. Each of the three models has a fourth column that erases or amplifies the order, depending on what you send through it.

The 272K cliff on GPT-5.5

Of the three asterisks, GPT-5.5's is the steepest. Once a single API call exceeds 272K input tokens, the entire session bills at $10 input / $45 output per million. Not the tokens above the threshold. The whole session. A 271K-token call bills at the standard rate. A 273K-token call bills 2x on every input token and 1.5x on every output token, including the first 272K.

For an 800K input / 20K output call, the math: 0.80M × $10 + 0.02M × $45 = $8.00 + $0.90 = $8.90. Without the surcharge it would be 0.80M × $5 + 0.02M × $30 = $4.60. The surcharge nearly doubles the bill. For long-context retrieval, RAG over a large codebase, or any agent that pulls big documents into context, that single threshold is the most expensive line in the OpenAI pricing page most people have not read.

Codex CLI caps context at 400K, which sits inside the surcharge zone. If you run agentic coding workloads through Codex on GPT-5.5, the surcharge fires by default any time the agent loads more than 272K of repository, logs, or test output. Plan for it or route long-context work elsewhere.

The 200K step on Gemini 3.1 Pro

Gemini's tier flip sits at a lower threshold (200K) but at a smaller multiple (2x input, 1.5x output, just like GPT-5.5). The pricing page lists the rates as $0.00000200 input / $0.00001200 output per token below 200K, jumping to $0.00000400 and $0.00001800 above. Like OpenAI's rule, the higher tier applies to the entire session once you cross the line.

On the 800K / 20K shape from above, Gemini 3.1 Pro sits at $3.56. That is still less than half what GPT-5.5 charges for the same call, even with the tier flip active. The 200K threshold matters more for medium-sized requests. A 200K-token retrieval call bills $1.00 on tier 1. A 201K-token version of the same call bills $1.70. One extra chunk pulled into context flips a 70% premium onto the entire bill.

If you run a retrieval pipeline that hovers around 200K input, treat the threshold as a hard budget. Truncate the lowest-relevance chunks rather than passing them through. Saving 10K of marginal context saves 70% of the bill.

Opus 4.7's tokenizer tax, in one sentence

Opus 4.7 is the only one of the three with a flat per-token rate at every context length. The rate is $5/$25, identical to Opus 4.6. What changed is that the same string of code, JSON, or natural language now produces 1.0 to 1.35x as many tokens as it did on Opus 4.6. Three weeks of production billing data from Finout, OpenRouter, CloudZero, and a few publicly shared invoices put the real-world inflation at 25 to 37 percent on chat, RAG, and coding workloads.

That math is documented in detail in our Opus 4.7 tokenizer post. For the comparison here, the takeaway is: when this article quotes Opus 4.7 at $0.50 on a 50K input call, the production version of that call is closer to $0.625. The post-tokenizer column matters more than the sticker for any planning forecast.

Five workloads, five different winners

Bills, computed on real-world request shapes. Sticker math first; Opus 4.7 also shown with a +25% tokenizer adjustment to reflect what the bill actually looks like.

WorkloadGPT-5.5Opus 4.7 (sticker / +25%)Gemini 3.1 ProCheapest
Casual coding (50K in / 10K out)$0.55$0.50 / $0.625$0.22Gemini
Mid refactor at the line (200K in / 50K out)$2.50$2.25 / $2.81$1.00Gemini
Refactor over the line (250K in / 50K out)$2.75$2.50 / $3.13$1.90Gemini
Repo-scale agent (500K in / 100K out)$9.50$5.00 / $6.25$3.80Gemini
Long-context audit (800K in / 20K out)$8.90$4.50 / $5.63$3.56Gemini

Every row says Gemini, which is the part of this comparison that nobody will be surprised by. The interesting numbers are the ratios. On row one, Gemini comes in at roughly 40% of GPT-5.5's bill. On row five, with the surcharge active, Gemini sits at about 40% again. The gap holds across the entire workload spectrum, even though all three models cross at least one cliff in the middle. Independent benchmarking by Artificial Analysis lands on a similar cost-per-task ratio at the median request size.

Opus 4.7's sticker stays competitive: it beats GPT-5.5 on every row and beats Gemini on none. Apply the tokenizer adjustment and Opus loses to GPT-5.5 on the small-input rows too. The point is not that one model dominates; it is that the shape of the request flips the local ordering of the trailing two models.

Where each model earns its premium

Cheapest does not mean best. The benchmark sheets explain what you are paying extra for when you do not pick Gemini.

BenchmarkGPT-5.5Opus 4.7Gemini 3.1 Pro
SWE-bench Verified (coding)~74%87.6%80.6%
MMLU-Pro (knowledge)83.2%~82%75.8%
GPQA Diamond (graduate science)93.6%94.2%94.3%
MRCR v2 (long-context retrieval)74.0% @ 1M32.2% @ 1M84.9% @ 128K

Three different leaders. Opus owns coding by a wide margin, both versus the other two and versus its own predecessor; the SWE-bench Verified jump from 79.4% on Opus 4.6 to 87.6% on 4.7 is the largest generational gain Anthropic has shipped on coding since Sonnet 3.5. GPT-5.5 owns knowledge breadth and the only credible 1M-token retrieval score in the bracket. Gemini owns short-window retrieval, science, and the price sheet.

Note the long-context regression on Opus 4.7. Anthropic published 78.3% on MRCR v2 for 4.6 and 32.2% for 4.7. Same benchmark, same harness, large drop. Several community runs have replicated it. Pair that with the 1.0-1.35x tokenizer inflation and Opus 4.7 is structurally a worse choice for long-context work than Opus 4.6 was, on top of being more expensive.

The 1M calls per month math

Same casual-coding shape, scaled to a million calls per month. Order-of-magnitude estimate, useful for budget conversations, not for a precise forecast.

ModelPer call1M calls / monthAnnualized
GPT-5.5$0.55$550,000$6.6M
Opus 4.7 (sticker)$0.50$500,000$6.0M
Opus 4.7 (with +25% tokenizer)$0.625$625,000$7.5M
Gemini 3.1 Pro$0.22$220,000$2.6M

$4.9 million annualized between GPT-5.5 and Gemini at the same volume. $4.0 million even after applying the Opus tokenizer correction. At that scale, the benchmark gap on a specific task type either pays for itself or it does not, and the answer matters in the millions. Run the eval before you sign the procurement contract.

Routing the request, not the procurement contract

For coding-heavy work where SWE-bench class quality dominates the decision, Opus 4.7 is worth the tokenizer tax (and the long-context regression makes it a bad fit only if your prompts run past 200K). For knowledge breadth or 1M-token retrieval, GPT-5.5 earns its sticker, but plan the 272K cliff into your context budget and cap sessions below it where you can.

For everything else, including the boring 80 percent of production traffic, Gemini 3.1 Pro is more than half the price across every workload size and competitive on the benchmarks that aren't named SWE-bench. The 200K tier flip is a real cost event but the post-flip price is still under what either competitor charges below the line.

If you cannot decide, route by request shape: small requests to Gemini, coding requests to Opus, and 1M-context retrieval to GPT-5.5. The hybrid stack costs less than committing to any single one of them, which is what most teams running real production load are quietly doing already.

Sources