Tokenmaxxing can inflate your LLM API bill by 10x. On Gemini and GPT-5.4, it's worse.
The term went viral this week for a practice developers have been doing forever: stuffing prompts with as much context as possible to get better outputs. The cost math is not linear. Two frontier models have hard pricing cliffs where crossing a token threshold retroactively doubles the rate for the entire request, not just the tokens over the line.

Photo by Joao Vitor Duarte on Unsplash
Sending 1M tokens per call instead of 50k multiplies your bill by 20x on flat-rate models -- and by 35x on Gemini 3.1 Pro and GPT-5.4, which both have pricing cliffs that retroactively double the rate for the entire request once you cross their threshold. There's also a quieter version: Claude Opus 4.7's new tokenizer has been generating 20-35% more tokens for the same code since April 16, with no change to the rate card.
What tokenmaxxing actually is
The behavior is simple: developers send more context than they need because curating context takes time and more context often seems to help. Dump the entire repo. Include the full conversation history. Send the raw document rather than the relevant sections. The model will figure it out.
This is fine when tokens are cheap and your workloads are small. It gets expensive fast when you are running pipelines at scale or using models with tiered pricing. The term went viral this week because developers comparing their April billing statements started doing the math and posting about it.
What the conversation mostly missed: two of the biggest frontier models do not price large contexts linearly. Here is what it actually costs.
The cost comparison: 10 calls at four context sizes
Total API cost for 10 calls at four input sizes: 50k tokens per call (lean), 200k (near the Gemini cliff), 500k (past both cliffs), and 1M (full tokenmaxxing). Output assumed at 20% of input. Prices as of April 2026.
| Model | 50k / call | 200k / call | 500k / call | 1M / call | 50k→1M |
|---|---|---|---|---|---|
| Claude Opus 4.7 | $5.00 | $20.00 | $50.00 | $100.00 | 20x |
| Claude Sonnet 4.6 | $3.00 | $12.00 | $30.00 | $60.00 | 20x |
| GPT-5.4 | $2.75 | $11.00 | $47.50 | $95.00 | 35x * |
| Gemini 3.1 Pro | $2.20 | $8.80 | $38.00 | $76.00 | 35x * |
| GPT-5.4 Mini | $0.83 | $3.30 | $8.25 | $16.50 | ~20x |
| DeepSeek V3.2 | $0.18 | $0.73 | $1.82 | $3.64 | ~20x |
* GPT-5.4 (272k threshold) and Gemini 3.1 Pro (200k threshold) retroactively apply the higher rate to the entire request when a single call exceeds the limit.
The pricing cliff nobody warned you about
Both Gemini 3.1 Pro and GPT-5.4 have a threshold where exceeding it retroactively bills the entire request at a higher rate. Not just the tokens over the limit -- all of them. It works like a tax bracket that does not phase in. Cross the line by one token and your whole request moves to the higher tier.
Gemini 3.1 Pro: the 200k cliff
Under 200k tokens per request: $2.00/MTok input, $12.00/MTok output. Over 200k: $4.00 and $18.00 respectively, for the entire request. Here is what that looks like for a pipeline running 10 calls a day sitting right at the boundary:
| Scenario | Input per call | Daily cost (10 calls) |
|---|---|---|
| Just under the cliff | 199k tokens | $8.76 |
| Just over the cliff | 201k tokens | $15.24 |
Adding 20,000 tokens total across 10 calls -- 2,000 per call -- costs an extra $6.48 a day. Annualized, that is about $2,360 for a codebase that grew slightly or a prompt template that picked up a few more examples. The billing system does not care that 199,999 of those 201,000 tokens were already sitting in the lower bracket.
GPT-5.4: the 272k cliff
OpenAI's rate flips from $2.50 to $5.00/MTok input at 272k tokens per request. Output goes from $15 to $22.50/MTok. Same retroactive logic:
| Scenario | Input per call | Daily cost (10 calls) |
|---|---|---|
| Just under the cliff | 270k tokens | $14.85 |
| Just over the cliff | 280k tokens | $26.60 |
A 79% spike in a day's GPT-5.4 bill is worth checking against your per-request token counts. If anything in your pipeline recently started sending larger requests, that is probably the answer.
Accidental tokenmaxxing: the Opus 4.7 tokenizer change
There is a version of tokenmaxxing that requires no change to what you are sending. Anthropic shipped Opus 4.7 on April 16 with a new tokenizer that splits code and structured data more granularly. The same Python file, the same JSON payload, the same XML schema -- up to 35% more tokens than Opus 4.6 would have counted.
The rate card did not change. It is still $5/MTok input. But for code-heavy pipelines, the bill went up automatically on April 16 with no pricing announcement. The tokenizer change is documented by Anthropic as an improvement to text processing efficiency. The cost side effect is less prominently mentioned.
| Content type | Token multiplier vs Opus 4.6 | Effective cost at $5/MTok |
|---|---|---|
| English prose | ~1.0x | $5.00/MTok |
| Mixed code and text | 1.15-1.25x | $5.75-6.25/MTok |
| Python / JavaScript | 1.2-1.3x | $6.00-6.50/MTok |
| JSON / XML / YAML | up to 1.35x | up to $6.75/MTok |
If you migrated to Opus 4.7 and your API spend went up without any obvious usage increase, run a token count comparison on a representative sample of your inputs. The tokenizer change explains most of it.
Does more context actually help?
This is the assumption that justifies the whole practice, and the research on it is more uncomfortable than most people expect.
A 2023 Stanford study (Liu et al., "Lost in the Middle") found that LLM performance degrades when relevant information is buried in the middle of a long context, even for models designed for long context. A January 2026 follow-up study testing Gemini 2.5 Flash, GPT-5 Mini, Claude Haiku 4.5, and DeepSeek V3.2 on needle-in-haystack tasks found that longer contexts alone do not guarantee better performance and can hurt when relevant evidence is diluted. Some models showed severe degradation under realistic conditions.
We are not saying context never helps -- it clearly does for tasks that genuinely need it. The problem is the untested assumption that bigger is always better. At 10-17x the cost, that assumption is worth actually testing on your specific workload before you commit to it.
How to stop paying for context you do not need
Prompt caching is the easiest starting point. If your system prompt or shared documentation does not change between calls, you should not be paying full input price for it on every request. Anthropic charges $0.50/MTok for cache reads (90% off the $5/MTok rate) after one cache write. OpenAI caches automatically at 90% off. Gemini caches at $0.20/MTok under 200k. One developer reported going from $720/month to $72 just by caching a system prompt that was being re-sent on every call.
RAG cuts context by retrieving only what is relevant rather than sending everything. For codebases, Chonkie (open source) does tree-sitter-aware chunking that splits files along semantic boundaries rather than arbitrary character counts, which makes the retrieved chunks more useful and the context smaller.
If you use Gemini 3.1 Pro or GPT-5.4, track your per-request token counts actively. Normal usage patterns can drift across a pricing threshold when a codebase grows or a prompt template picks up new examples. The billing spike will not come with a warning.
For long agent loops, context checkpointing helps: summarize completed phases and start fresh sessions. This prevents token accumulation across multi-turn conversations and keeps individual requests under pricing cliffs.
Batching stacks with caching. On Anthropic models, the Batch API cuts 50% off an already-cached read price. A Claude Sonnet 4.6 call with a cached system prompt via Batch API costs $0.15/MTok input -- 95% off the standard $3 rate.
Sources
- Anthropic API pricing (Claude Opus 4.7, Sonnet 4.6) -- anthropic.com
- OpenAI API pricing (GPT-5.4, GPT-5.4 Mini) -- openai.com
- Google AI pricing (Gemini 3.1 Pro) -- ai.google.dev
- DeepSeek API pricing -- platform.deepseek.com
- Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" -- arXiv:2307.03172 (2023)
- Anthropic, Claude Opus 4.7 release -- anthropic.com