Skip to main content
TokenCost logoTokenCost
ComparisonMarch 6, 2026·10 min read

GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro: Which One Should You Actually Use?

Three frontier models in five weeks. Opus 4.6 on February 4, Gemini 3.1 Pro on February 19, GPT-5.4 on March 5. Everyone wants to know which one is "best." The honest answer is annoying: it depends on what you're doing. So I went through the benchmark data, ran the pricing math, and tried to figure out where each one actually earns its keep.

GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro comparison

The short version

  • -Best for coding: Claude Opus 4.6 (80.8% SWE-Bench Verified). GPT-5.4 is close at 57.7% on the harder SWE-Bench Pro metric.
  • -Best for reasoning: Gemini 3.1 Pro (94.3% GPQA Diamond, 77.1% ARC-AGI-2). It's also the cheapest.
  • -Best for professional work: GPT-5.4 (83% GDPval across 44 professions, plus native computer use).
  • -Best on budget: Gemini 3.1 Pro. $2/$12 per 1M tokens with a 2M context window.

Pricing: 15x difference between cheapest and most expensive

The price gap across these three is massive. Gemini 3.1 Pro costs $2 per million input tokens. GPT-5.4 Pro costs $30. That's 15x for models that sometimes score within a few percentage points of each other. You can see all current rates side by side on our pricing comparison table.

ModelInput / 1MOutput / 1MContext
Gemini 3.1 Pro$2.00$12.002M
GPT-5.4$2.50$15.001.05M
Sonnet 4.6$3.00$15.00200K
Claude Opus 4.6$5.00$25.00200K
GPT-5.4 Pro$30.00$180.001.05M

A few things jump out here. Gemini 3.1 Pro matches GPT-5.4 Pro on GPQA Diamond (94.3% vs 94.4%) at literally 15x less cost. That gap is hard to justify unless you specifically need GPT-5.4 Pro's extended thinking for some niche task.

Claude Opus 4.6 is the most expensive standard-tier model at $5/$25, but it also has the strongest coding scores. Whether that premium is worth it depends entirely on your workload. For a code-heavy pipeline, probably yes. For general Q&A, probably not.

Benchmarks: nobody wins everything

Here's what surprised me. There's no runaway winner. GPT-5.4 takes 4 categories, Gemini takes 4, Claude takes 2. Check our LLM leaderboard for the full ranked list. But the wins aren't equal in importance, and some of these benchmarks matter more than others depending on what you build.

BenchmarkGPT-5.4Opus 4.6Gemini 3.1
GPQA Diamond
Science reasoning
92.8%91.3%94.3%
ARC-AGI-2
Abstract reasoning
73.3%75.2%77.1%
SWE-Bench Verified
Production coding
--80.8%80.6%
SWE-Bench Pro
Hard coding tasks
57.7%--54.2%
MMMU-Pro
Visual reasoning
81.2%85.1%80.5%
GDPval
44 professions
83.0%78.0%--
OSWorld
Desktop automation
75.0%72.7%--
BrowseComp
Web browsing
82.7%84.0%85.9%
Terminal-Bench 2.0
Terminal workflows
75.1%65.4%68.5%
MCP Atlas
Tool coordination
--~59.5%69.2%

Green = category winner. "--" = score not available or not directly comparable. SWE-Bench Verified and SWE-Bench Pro use different test sets. Sources: OpenAI, Anthropic, and Google official model cards.

GPT-5.4 benchmark ladder showing scores across GDPval (83%), SWE-Bench Pro (57.7%), OSWorld (75%), Toolathlon (54.6%), and BrowseComp (82.7%) compared to GPT-5.3-Codex and GPT-5.2

GPT-5 series benchmark ladder. Note: 5.4 is broadly better than 5.3-Codex, not just on coding. Source: OpenAI.

GPT-5 version comparison from 5.1 to 5.4 showing context window growth from 400K to 1.05M and pricing from $1.25/$10 to $2.50/$15

How GPT-5 evolved: context jumped from 400K to 1.05M, pricing went up about 43% on input.

Where each model actually wins

Coding

Claude Opus 4.6 scores 80.8% on SWE-Bench Verified. Gemini 3.1 Pro is right behind at 80.6%. These two are nearly identical on production-level coding tasks.

GPT-5.4 wasn't tested on the same SWE-Bench Verified set, but it scores 57.7% on the harder SWE-Bench Pro benchmark (Gemini gets 54.2% there). The Codex integration gives GPT-5.4 some practical advantages too: 1.5x faster token throughput in /fast mode, and a tool search feature that cuts token usage by 47%.

If coding is your main use case, Opus 4.6 is probably the safest pick. But Sonnet 4.6 at $3/$15 scores almost as well (79.6% SWE-Bench Verified) and costs 40% less on input. Worth considering.

Reasoning

Gemini 3.1 Pro takes this one clearly. 94.3% on GPQA Diamond (graduate-level science questions) and 77.1% on ARC-AGI-2 (abstract pattern recognition). Those are the two hardest reasoning benchmarks we have, and Gemini leads on both.

Here's the part that should make OpenAI uncomfortable: GPT-5.4 Pro scores 94.4% on GPQA Diamond at $30/1M input. Gemini 3.1 Pro hits 94.3% at $2/1M input. Nearly identical reasoning, 15x the cost.

Computer use and desktop automation

GPT-5.4 scores 75% on OSWorld, which measures how well a model can navigate desktop applications. That beats the human baseline of 72.4%. Claude Opus 4.6 is close at 72.7%. Gemini doesn't have native computer use yet.

If you're building agents that need to interact with GUIs, GPT-5.4 and Claude are your only real options right now. GPT-5.4 has a slight edge on the benchmark, but Anthropic has been shipping computer use features for longer, so their tooling is more mature.

Visual reasoning

Opus 4.6 leads on MMMU-Pro at 85.1% vs Gemini's 80.5%. GPT-5.4 scored 81.2% on MMMU-Pro in its own benchmarks (from the OpenAI release), which puts it between the two. If you're processing charts, diagrams, or screenshots, Opus has the edge.

Context windows: bigger isn't always cheaper

Gemini 3.1 Pro has 2M tokens of context. GPT-5.4 has 1.05M. Opus 4.6 has 200K standard (1M in beta). On paper, Gemini wins by a mile. In practice, it's more nuanced.

GPT-5.4's context window has a pricing trap. Go above 272K input tokens and your rate doubles to $5/1M input and jumps to $22.50/1M output. A 500K-token prompt that would cost $1.25 at base rate actually costs $2.50. Gemini has no such threshold at all. 2M tokens, flat $2/1M rate, no gotchas.

Max output is another thing people overlook. GPT-5.4 can generate 128K tokens in one response. Gemini does 65K. Opus does 32K. If you're generating full documents or long code files, that 4x difference between GPT-5.4 and Opus matters a lot.

Three things people get wrong about this comparison

1. "GPT-5.4 Pro is the best model." It scores 94.4% on GPQA Diamond, yes. But Gemini 3.1 Pro scores 94.3% at 15x less cost. The Pro tier only makes sense if you need extended thinking for multi-step problems where that extra 0.1% compounds. For most tasks, standard GPT-5.4 or Gemini will give you the same answer.

2. "SWE-Bench scores are directly comparable." They're not. Claude and Gemini report SWE-Bench Verified (easier set). GPT-5.4 reports SWE-Bench Pro (harder set). Opus's 80.8% on Verified and GPT-5.4's 57.7% on Pro aren't measuring the same thing. On the same Pro benchmark, GPT-5.4 beats Gemini (57.7% vs 54.2%), but we don't have Opus's Pro score.

3. "Cheaper model = worse model." Gemini 3.1 Pro at $2/1M input leads on 4 out of 10 benchmarks in the table above and is competitive on all the rest. Price stopped being a reliable quality signal in this generation.

What would this actually cost you?

Abstract pricing per million tokens doesn't mean much until you see it at scale. You can estimate your own monthly spend with our cost calculator, but here's how the math looks on three common scenarios:

1,000 code reviews (100K tokens in, 8K out each)
Gemini 3.1
$272
GPT-5.4
$370
Opus 4.6
$700
10,000 chat interactions (10K in, 2K out each)
Gemini 3.1
$440
GPT-5.4
$550
Opus 4.6
$550
500 document analyses (500K in, 16K out each)
Gemini 3.1
$596
GPT-5.4
$1,430
Opus 4.6
$1,450

The document analysis scenario is where GPT-5.4's 272K pricing threshold kicks in hard. At 500K tokens per prompt, you're paying $5/1M instead of $2.50/1M on input, plus $22.50/1M on output instead of $15. That 2x/1.5x multiplier adds up at scale.

Gemini wins on cost in every scenario, sometimes by 2-3x. The question is whether the benchmark differences justify paying more. For coding tasks, I think Opus's premium is defensible. For general reasoning, it's harder to argue against Gemini.

So which one should you pick?

Rather than another "it depends" non-answer, here's how I'd think about it based on what you're building:

GPT-5.4 is the right call when...

You're building desktop automation agents, working across professional domains (finance, law, medicine), or need massive output (128K tokens per response). Also the obvious choice if you're already deep in the OpenAI ecosystem with Codex workflows.

Go with Opus 4.6 when...

Code quality matters more than cost. It has the best coding benchmarks and the strongest visual reasoning (85.1% MMMU-Pro). The writing quality tends to be more nuanced too, though that's harder to benchmark. Sonnet 4.6 at 40% less is worth trying first.

Gemini 3.1 Pro makes sense when...

You care about cost, need the biggest context window (2M tokens, no pricing threshold), or do scientific/abstract reasoning work. It's also the best choice for high-volume batch processing where the per-token savings compound.

What I'd actually do

Gun to my head, one model for everything: Gemini 3.1 Pro. The reasoning is as good as anything else on the market, the context window is the biggest, and it costs less than a fancy coffee per million tokens.

But that's a forced choice. In reality, I'd route coding tasks to Sonnet 4.6 (nearly as good as Opus at 40% less) and everything else to Gemini. If a task specifically needs computer use or desktop automation, GPT-5.4 gets the call.

The weird thing about this generation is that the model you pick matters less than how you use it. Prompt caching, smart routing between tiers, and staying under pricing thresholds will save you more money than agonizing over which frontier model scores 2% higher on some benchmark.