GPT-5.4 computer use: what a real agent task actually costs
Computer use sessions are input-heavy by design. Screenshots compound with conversation history on every turn, GPT-5.4 has a pricing cliff at 272K tokens that retroactively doubles your input bill, and the $2.50/M baseline rate tells you less than you think. Here is the cost breakdown for real tasks.

Photo by Abu Saeid on Unsplash
GPT-5.4 charges $2.50 per million input tokens for computer use -- no premium over the standard API. The catch is that computer use sessions are input-heavy by nature: each turn carries the full conversation history plus a new screenshot. A 5-turn web lookup costs about $0.03. A 15-turn data extraction task runs around $0.32. And if your session crosses GPT-5.4's 272K input threshold -- which a long agentic workflow can do around turn 45-50 -- the input rate doubles to $5.00/M retroactively.
Why screenshots cost more than they look
Computer use works in a loop. You send a task, the model returns an action (click, type, scroll), your code executes it, takes a screenshot, and sends it back. Repeat until done. The screenshots are what drives the cost -- but not for the reason you might expect.
At 1280x720, each screenshot runs roughly 1,200 tokens in GPT-5.4's 32-pixel patch system. That's about $0.003 per frame at $2.50/M. Cheap. The problem is that the full conversation history stays in context for every subsequent turn. By turn 10, you're sending around 15,000 tokens of context per call. By turn 20, it's closer to 35,000. The screenshots are just the part that keeps compounding.
OpenAI recommends 1024x768 for computer use, and they say it explicitly in their computer use guide. That resolution runs about 1,050 tokens per screenshot. Switching to 1920x1080 -- the obvious default for a desktop environment -- costs 4-5x more tokens per frame with no improvement in click accuracy for most UI tasks. That single setting can take a meaningful bite out of the bill on anything longer than a handful of turns.
Reasoning effort compounds things further. GPT-5.4 supports five levels: none, low, medium, high, and xhigh. At high or xhigh, the model generates substantially more reasoning tokens on top of the base cost -- roughly 3-5x more per call. Most computer use tasks run fine at medium. High is worth testing on tasks with ambiguous UI or multi-step planning, but it should be a deliberate choice.
What typical tasks actually cost
These estimates use 1024x768 resolution and medium reasoning effort. Token counts are derived from OpenAI's computer use documentation and third-party session analysis. Real costs vary with page content length, tool call frequency, and how much text the model extracts from the UI. When we ran a 20-turn research session against a multi-page procurement portal, total input came in around 130K tokens -- well under the 272K threshold, but roughly 4x higher than we initially expected before accounting for context accumulation.
| Task | Turns | Approx. tokens | Cost (GPT-5.4) |
|---|---|---|---|
| Web lookup (navigate + read) | 5 | ~5K in / ~1K out | ~$0.03 |
| Form fill (3-5 fields) | 8 | ~20K in / ~3K out | ~$0.09 |
| Data extraction (scrape + format) | 15 | ~80K in / ~8K out | ~$0.32 |
| Research workflow (multi-page) | 30 | ~200K in / ~15K out | ~$0.73 |
| Enterprise automation (50+ steps) | 50+ | >300K in (cliff risk) | $1.50-5+ (see below) |
Estimates at 1024x768 and medium reasoning. Source: OpenAI computer use guide
The 272K cliff
GPT-5.4 has a 1,050,000-token context window. Most sessions never come close. But there's a pricing threshold at 272K input tokens that matters more than the total context size: cross it, and your input rate doubles from $2.50 to $5.00 per million. Output jumps from $15.00 to $22.50. Both rates apply retroactively to the entire session -- not just to the tokens over the line.
The math is abrupt. A 300K input / 20K output session costs $1.95 at long-context rates. Trim it to 250K input and 20K output -- 50K tokens less -- and the same session costs $0.93. You either pay $2.50/M or you pay $5.00/M. There's no gradual slope.
At 1024x768 with typical context buildup, sessions generate roughly 5,000-7,000 input tokens per turn. That puts the 272K threshold around turn 40-50 for most tasks. Short workflows are well clear of it. A research agent that reads several full pages, processes their content, and builds a structured output can get there.
The practical fix is context checkpointing. At natural task milestones -- after completing a phase, after extracting a data batch -- summarize what you have, discard the old screenshots, and start a fresh session for the next phase. Each checkpoint resets token accumulation. It adds some pipeline complexity but keeps sessions in the cheap tier and makes debugging easier as a side effect.
GPT-5.4 vs Claude vs Gemini for computer use
All three major providers offer computer use APIs. The pricing differences are real and the tradeoffs go beyond cost per token.
| Model | Input / 1M | Output / 1M | OSWorld | Notes |
|---|---|---|---|---|
| GPT-5.4 | $2.50 | $15.00 | 75.0% | Doubles to $5/$22.50 over 272K input |
| GPT-5.4-mini | $0.75 | $4.50 | - | No OSWorld published; good for predictable tasks |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 72.5% | No pricing cliff; +$0.08/session-hr (managed agents) |
| Claude Opus 4.6 | $5.00 | $25.00 | 72.7% | +$0.08/session-hr (managed agents) |
| Claude Haiku 4.5 | $1.00 | $5.00 | - | Cheapest; best for structured, predictable automation |
| Gemini 2.5 CU Preview | $1.25 | $10.00 | 69.2% | Preview only as of April 2026; doubles over 200K |
Sources: OpenAI pricing · Anthropic pricing · Google AI pricing
Claude doesn't have a cliff like GPT-5.4's 272K threshold. Sonnet 4.6 bills at $3.00/M input flat -- slightly pricier than GPT-5.4 for short sessions, but no retroactive doubling risk on long ones. That difference matters if your workflows vary in length or are hard to predict at design time.
Gemini 2.5 Computer Use Preview looks good on price, but "preview" carries real operational risk -- rate limits, API changes, and no production SLA. It's worth benchmarking now. Putting it in a production pipeline today is a different call.
What the OSWorld score actually tells you
GPT-5.4 posts the best published OSWorld-Verified score at 75.0%, against Claude Opus 4.6 at 72.7% and Sonnet 4.6 at 72.5%. OSWorld measures whether an agent successfully completes real UI tasks -- forms, file operations, application navigation -- so it is a more relevant benchmark for computer use than most.
Whether a 2-3 percentage point gap matters depends on what failure costs. At 10,000 tasks per month, GPT-5.4 at 75% completes around 750 more tasks than a model at 67.5%. That may be worth the input cost premium. For a pipeline that runs once a day and retries at low cost, the benchmark gap probably is not the variable that matters most.
OSWorld is also diverse by design -- it covers many application types and OS contexts. If your agent spends most of its time in one specific application, the real-world performance gap could be larger or smaller than the headline number. Task-specific evaluation on your actual workflow is worth doing before committing to a provider based on benchmark scores alone.
Four settings worth checking before you scale
Set resolution to 1024x768. OpenAI says it in their docs. At 1920x1080, you pay 4-5x more tokens per screenshot with no accuracy gain on standard UI tasks.
Run reasoning at medium, not high. High and xhigh generate 3-5x more reasoning tokens. Start at medium and only increase if task quality actually requires it.
Checkpoint context at task milestones. Summarize completed steps and start a fresh session for the next phase. Sessions under 272K pay $2.50/M input instead of $5.00/M.
Try gpt-5.4-mini on routine steps. At $0.75/M input and $4.50/M output, it's 70% cheaper on input than the standard model. Form filling, predictable navigation, and structured data entry are the tasks most worth testing first.
Sources
- GPT-5.4 API pricing - OpenAI Developer Documentation
- Computer use guide - OpenAI Developer Documentation
- Claude API pricing - Anthropic Documentation
- Claude computer use tool reference - Anthropic Documentation
- Gemini API pricing - Google AI for Developers