Skip to main content
TokenCost logoTokenCost
GuideApril 18, 2026·7 min read

Tokenmaxxing can inflate your LLM API bill by 10x. On Gemini and GPT-5.4, it's worse.

The term went viral this week for a practice developers have been doing forever: stuffing prompts with as much context as possible to get better outputs. The cost math is not linear. Two frontier models have hard pricing cliffs where crossing a token threshold retroactively doubles the rate for the entire request, not just the tokens over the line.

Abstract colorful light streams on dark background representing data flowing through LLM APIs

Photo by Joao Vitor Duarte on Unsplash

Sending 1M tokens per call instead of 50k multiplies your bill by 20x on flat-rate models -- and by 35x on Gemini 3.1 Pro and GPT-5.4, which both have pricing cliffs that retroactively double the rate for the entire request once you cross their threshold. There's also a quieter version: Claude Opus 4.7's new tokenizer has been generating 20-35% more tokens for the same code since April 16, with no change to the rate card.

What tokenmaxxing actually is

The behavior is simple: developers send more context than they need because curating context takes time and more context often seems to help. Dump the entire repo. Include the full conversation history. Send the raw document rather than the relevant sections. The model will figure it out.

This is fine when tokens are cheap and your workloads are small. It gets expensive fast when you are running pipelines at scale or using models with tiered pricing. The term went viral this week because developers comparing their April billing statements started doing the math and posting about it.

What the conversation mostly missed: two of the biggest frontier models do not price large contexts linearly. Here is what it actually costs.

The cost comparison: 10 calls at four context sizes

Total API cost for 10 calls at four input sizes: 50k tokens per call (lean), 200k (near the Gemini cliff), 500k (past both cliffs), and 1M (full tokenmaxxing). Output assumed at 20% of input. Prices as of April 2026.

Model50k / call200k / call500k / call1M / call50k→1M
Claude Opus 4.7$5.00$20.00$50.00$100.0020x
Claude Sonnet 4.6$3.00$12.00$30.00$60.0020x
GPT-5.4$2.75$11.00$47.50$95.0035x *
Gemini 3.1 Pro$2.20$8.80$38.00$76.0035x *
GPT-5.4 Mini$0.83$3.30$8.25$16.50~20x
DeepSeek V3.2$0.18$0.73$1.82$3.64~20x

* GPT-5.4 (272k threshold) and Gemini 3.1 Pro (200k threshold) retroactively apply the higher rate to the entire request when a single call exceeds the limit.

The pricing cliff nobody warned you about

Both Gemini 3.1 Pro and GPT-5.4 have a threshold where exceeding it retroactively bills the entire request at a higher rate. Not just the tokens over the limit -- all of them. It works like a tax bracket that does not phase in. Cross the line by one token and your whole request moves to the higher tier.

Gemini 3.1 Pro: the 200k cliff

Under 200k tokens per request: $2.00/MTok input, $12.00/MTok output. Over 200k: $4.00 and $18.00 respectively, for the entire request. Here is what that looks like for a pipeline running 10 calls a day sitting right at the boundary:

ScenarioInput per callDaily cost (10 calls)
Just under the cliff199k tokens$8.76
Just over the cliff201k tokens$15.24

Adding 20,000 tokens total across 10 calls -- 2,000 per call -- costs an extra $6.48 a day. Annualized, that is about $2,360 for a codebase that grew slightly or a prompt template that picked up a few more examples. The billing system does not care that 199,999 of those 201,000 tokens were already sitting in the lower bracket.

GPT-5.4: the 272k cliff

OpenAI's rate flips from $2.50 to $5.00/MTok input at 272k tokens per request. Output goes from $15 to $22.50/MTok. Same retroactive logic:

ScenarioInput per callDaily cost (10 calls)
Just under the cliff270k tokens$14.85
Just over the cliff280k tokens$26.60

A 79% spike in a day's GPT-5.4 bill is worth checking against your per-request token counts. If anything in your pipeline recently started sending larger requests, that is probably the answer.

Accidental tokenmaxxing: the Opus 4.7 tokenizer change

There is a version of tokenmaxxing that requires no change to what you are sending. Anthropic shipped Opus 4.7 on April 16 with a new tokenizer that splits code and structured data more granularly. The same Python file, the same JSON payload, the same XML schema -- up to 35% more tokens than Opus 4.6 would have counted.

The rate card did not change. It is still $5/MTok input. But for code-heavy pipelines, the bill went up automatically on April 16 with no pricing announcement. The tokenizer change is documented by Anthropic as an improvement to text processing efficiency. The cost side effect is less prominently mentioned.

Content typeToken multiplier vs Opus 4.6Effective cost at $5/MTok
English prose~1.0x$5.00/MTok
Mixed code and text1.15-1.25x$5.75-6.25/MTok
Python / JavaScript1.2-1.3x$6.00-6.50/MTok
JSON / XML / YAMLup to 1.35xup to $6.75/MTok

If you migrated to Opus 4.7 and your API spend went up without any obvious usage increase, run a token count comparison on a representative sample of your inputs. The tokenizer change explains most of it.

Does more context actually help?

This is the assumption that justifies the whole practice, and the research on it is more uncomfortable than most people expect.

A 2023 Stanford study (Liu et al., "Lost in the Middle") found that LLM performance degrades when relevant information is buried in the middle of a long context, even for models designed for long context. A January 2026 follow-up study testing Gemini 2.5 Flash, GPT-5 Mini, Claude Haiku 4.5, and DeepSeek V3.2 on needle-in-haystack tasks found that longer contexts alone do not guarantee better performance and can hurt when relevant evidence is diluted. Some models showed severe degradation under realistic conditions.

We are not saying context never helps -- it clearly does for tasks that genuinely need it. The problem is the untested assumption that bigger is always better. At 10-17x the cost, that assumption is worth actually testing on your specific workload before you commit to it.

How to stop paying for context you do not need

Prompt caching is the easiest starting point. If your system prompt or shared documentation does not change between calls, you should not be paying full input price for it on every request. Anthropic charges $0.50/MTok for cache reads (90% off the $5/MTok rate) after one cache write. OpenAI caches automatically at 90% off. Gemini caches at $0.20/MTok under 200k. One developer reported going from $720/month to $72 just by caching a system prompt that was being re-sent on every call.

RAG cuts context by retrieving only what is relevant rather than sending everything. For codebases, Chonkie (open source) does tree-sitter-aware chunking that splits files along semantic boundaries rather than arbitrary character counts, which makes the retrieved chunks more useful and the context smaller.

If you use Gemini 3.1 Pro or GPT-5.4, track your per-request token counts actively. Normal usage patterns can drift across a pricing threshold when a codebase grows or a prompt template picks up new examples. The billing spike will not come with a warning.

For long agent loops, context checkpointing helps: summarize completed phases and start fresh sessions. This prevents token accumulation across multi-turn conversations and keeps individual requests under pricing cliffs.

Batching stacks with caching. On Anthropic models, the Batch API cuts 50% off an already-cached read price. A Claude Sonnet 4.6 call with a cached system prompt via Batch API costs $0.15/MTok input -- 95% off the standard $3 rate.

Sources