GPT-5.5 has a pricing cliff at 272K tokens. Cross it and the whole request bills double.
GPT-5.5 lists at $5 input and $30 output per million tokens, the rate most people quote. That rate holds only up to 272K input tokens in a single request. Push one token past the line and OpenAI reprices the entire session at $10 and $45. Here is where the cliff sits, what it costs in dollars, and the workloads that walk you off the edge without warning.

Photo by Behnam Norouzi on Unsplash
The number you should remember is 272,000. Below it, GPT-5.5 runs at $5 input and $30 output per million tokens. Above it, the same request jumps to $10 input and $45 output, and the higher rate covers the full session rather than only the tokens past the mark.
The gap between a 270K-token request and a 275K-token one is not five thousand tokens of cost. It is roughly double the whole bill. If you feed GPT-5.5 large codebases, long documents, or fat retrieval contexts, this is the single line item most likely to surprise you.
What the cliff actually is
OpenAI states the rule plainly in its developer docs: for GPT-5.5, prompts with more than 272K input tokens are priced at 2x input and 1.5x output for the full session, across the standard, batch, and flex tiers. Applied to the base $5 and $30 rates, that lands at $10 input and $45 output per million. Independent trackers like OpenRouter surface the same two-tier split, which is a good sanity check when a rate looks too low.
Read the phrase "for the full session" twice, because it is the part that bites. This is not a tiered rate where the first 272K tokens bill at one price and the rest at another. Cross the threshold and every token in the request, input and output, charges at the elevated rate. A prompt at 271K input tokens sits in the cheap tier. The same prompt with 2,000 more tokens of context attached pays nearly twice as much, on all of it.
One more thing worth clearing up: there is no separate model called GPT-5.5 Short Context. You call a single model ID, gpt-5.5, and the short and long labels are just the two pricing columns. The model decides nothing. Your token count does.
The full rate card, under 272K
Start with the prices everyone sees. These are the per-million rates while a request stays under the threshold. Batch and flex shave 50% off standard, priority adds a premium for faster service, and Pro is a different beast that runs parallel reasoning passes.
| Serving tier | Input /1M | Cached /1M | Output /1M |
|---|---|---|---|
| Standard | $5.00 | $0.50 | $30.00 |
| Batch | $2.50 | $0.25 | $15.00 |
| Flex | $2.50 | $0.25 | $15.00 |
| Priority | $12.50 | $1.25 | $75.00 |
| GPT-5.5 Pro | $30.00 | n/a | $180.00 |
GPT-5.5 carries a 1.05M-token context window with up to 128K output. So the long-context tier is not a corner case you have to engineer toward. It is most of the upper half of the window you paid to access.
What each tier costs once you cross over
Here is the same rate card with the long-context column filled in. The 2x input and 1.5x output multiplier rides on top of whichever serving tier you picked. Notice that priority and Pro have no published long-context rate, so OpenAI does not document a surcharge there.
| Tier | Input ≤272K | Input >272K | Output ≤272K | Output >272K |
|---|---|---|---|---|
| Standard | $5.00 | $10.00 | $30.00 | $45.00 |
| Batch | $2.50 | $5.00 | $15.00 | $22.50 |
| Flex | $2.50 | $5.00 | $15.00 | $22.50 |
| Priority | $12.50 | no long tier | $75.00 | no long tier |
| Pro | $30.00 | no long tier | $180.00 | no long tier |
Batch over the threshold ($5 input, $22.50 output) still beats standard under it, so if your workload tolerates the batch turnaround, that is the cheapest way to run genuinely long prompts.
The cliff in dollars
Abstract multipliers hide the shock. Here is what a single standard-tier request costs at a range of sizes. The two highlighted rows straddle the threshold, and that pair is the whole story.
| Input tokens | Output tokens | Tier | Cost |
|---|---|---|---|
| 80K | 6K | Short | $0.58 |
| 200K | 10K | Short | $1.30 |
| 270K | 15K | Short | $1.80 |
| 275K | 15K | Long | $3.43 |
| 500K | 20K | Long | $5.90 |
| 900K | 40K | Long | $10.80 |
A 270K-token request with a 15K answer runs $1.80. Add 5,000 tokens of context, which is maybe two more files or a few extra retrieved passages, and the same call costs $3.43. You spent 1.8% more input and got a 90% bigger bill. Nothing in the API response flags this. The token count slides past 272K and the meter quietly switches rate cards.
At full window, the numbers stop being subtle. A 900K-token input with a 40K answer is $10.80 a call. Run that as an automated job a few thousand times a month and the long-context premium alone, the delta over what the standard rate would have charged, runs into real money.
When you hit 272K without meaning to
272K input tokens sounds like a lot until you count what modern workloads actually push through a prompt. A few of the common ways past it:
- Whole-repo coding agents. A medium codebase dumped into context with file trees, dependencies, and prior turns clears 272K fast. Agent loops that re-send growing history make every later step a long-context call.
- RAG with generous retrieval. Pull 40 chunks at 2K tokens each and you are already at 80K before the system prompt and the question. Crank top-k up for recall and you drift over the line on the long-tail queries.
- Document and transcript analysis. A 300-page contract, a deposition, or a quarter of support transcripts each blow past 272K on their own. These are exactly the jobs people reach for a 1M window to do.
- Long agent sessions. Conversations that accumulate tool outputs cross the threshold partway through, so the first half of the chat bills cheap and everything after it bills at the premium.
The pattern is consistent: the cliff lands precisely on the use cases that justify a giant context window in the first place. The bigger the window you exploit, the more of your traffic pays the surcharge.
Staying on the cheap side of the line
The surcharge is avoidable more often than people assume, because most prompts that cross 272K are padded rather than genuinely dense. A few levers:
- Split the work. Two 200K-token requests both bill at the standard rate. One 400K request bills the whole thing at the premium. When a task can be mapped over chunks, splitting is almost always cheaper than a single monster call.
- Trim retrieval. Reranking and tighter top-k cut the input that pushes you over without hurting answer quality much. The marginal chunk past 272K is the most expensive token you will send all day.
- Cache the fixed prefix. Cached input is $0.50 per million under the threshold, a tenth of fresh input. It does not change where the cliff sits, but it makes the input you must resend far cheaper.
- Batch the genuinely long jobs. If a prompt truly needs 500K of context, run it through batch. Over the threshold, batch is $5 input and $22.50 output, half the standard long-context rate.
Whatever you do, count the tokens before you ship the prompt. You can model the exact bill for any input and output size on our cost calculator, and the standard GPT-5.5 rates sit on the pricing page next to every other model.
OpenAI is zigging while Anthropic zagged
This is not a GPT-5.5 quirk. GPT-5.4 carried the same 272K rule, and so did the models before it. What changed is the base rate. GPT-5.4 sat at $2.50 input and $15 output, so its long-context tier topped out around $5 and $22.50. GPT-5.5 doubled the base to $5 and $30, which means the cliff above it doubled too. Same mechanic, bigger drop.
Anthropic went the other way. In March it scrapped the long-context surcharge it used to charge above 200K tokens, so Claude now bills one flat rate no matter how full the window gets. Opus 4.8 is $5 input and $25 output whether you send 10K tokens or 190K. That is a real difference in posture: Anthropic decided long context should not carry a penalty, and OpenAI decided it should, on the entire request.
For a workload that lives near the top of the context window, that gap can flip which model is cheaper per task, even when the headline rates look close. The sticker price tells you what a short prompt costs. It does not tell you what your prompt costs.
Price your own prompt before you send it
Drop your real input and output token counts into the calculator and see whether a request lands above or below the 272K line, then compare GPT-5.5 against Claude and Gemini for the same job. The whole point of TokenCost is catching surprises like this before the invoice does.
Questions
What is the GPT-5.5 272K pricing cliff?
Once a single request crosses 272,000 input tokens, OpenAI reprices the entire session at 2x input and 1.5x output. The base $5 input and $30 output per million become $10 and $45. It applies to the standard, batch, and flex tiers.
Does the higher rate apply only to the tokens above 272K?
No. The elevated rate covers the whole request, input and output, the moment you cross the line. A prompt at 273K input tokens charges every token at the higher rate, not just the 1,000 over the threshold. That is what makes it a cliff and not a tier.
Is GPT-5.5 Short Context a separate model?
No. There is one model ID, gpt-5.5. Short context and long context are pricing columns keyed off the 272K input-token threshold, not different models. You call the same model and the rate depends on how many tokens the request carries.
How do I avoid the long-context surcharge?
Keep each request under 272K input tokens. Split large documents into separate calls, trim retrieved context before sending, and cache the fixed parts of your prompt. If a prompt genuinely needs more than 272K, run it through batch, where the over-threshold rate is half the standard one.