DeepSeek V4-Flash redrew the budget LLM tier. Here is where Haiku 4.5, GPT-5.4 Nano, and Gemini Flash-Lite now sit.
On April 24, DeepSeek shipped a $0.14/$0.28-per-million-token model with a 1M context window and three reasoning modes. That single price point reset the floor for the entire budget tier. Eight days later, we have enough data to say what each of the four cheap-tier flagships is actually for.

Photo by Jason Jarrach on Unsplash
What each one costs
| Model | Input / 1M | Output / 1M | Cache read | Batch | Context |
|---|---|---|---|---|---|
| DeepSeek V4-Flash | $0.14 | $0.28 | $0.003 | n/a | 1M / 384K out |
| GPT-5.4 Nano | $0.20 | $1.25 | $0.02 | $0.10/$0.625 | 400K / 128K out |
| Gemini 3.1 Flash-Lite | $0.25 | $1.50 | $0.025 | $0.125/$0.75 | 1M / 65K out |
| Claude Haiku 4.5 | $1.00 | $5.00 | $0.10 | $0.50/$2.50 | 200K / 64K out |
Standard published rates as of May 1, 2026. DeepSeek lowered all cache-read prices to one-tenth of launch rates on April 26. Gemini Flash-Lite Priority tier is $0.45/$2.70.
The shape of the gap
Sort the four by output price and the spread is bigger than most people assume: $0.28, $1.25, $1.50, $5.00. Haiku 4.5 charges almost eighteen times what V4-Flash charges per million output tokens. Even the next-cheapest closed model, GPT-5.4 Nano, comes in at roughly four-and-a-half the V4-Flash output rate. So the cheap-tier conversation in May 2026 is really two conversations: V4-Flash vs everything else, and Haiku 4.5 vs the other two closed models.
On a workload that splits roughly even between input and output, here is what the month looks like at three volumes:
| Workload | V4-Flash | Nano | Flash-Lite | Haiku 4.5 |
|---|---|---|---|---|
| Daily summarizer (1M in / 100K out per day, 30-day month) | $5.04 | $9.75 | $12.00 | $45.00 |
| Mid-volume RAG (10M in / 1M out per day) | $50.40 | $97.50 | $120.00 | $450.00 |
| High-volume extraction (100M in / 10M out per day) | $504.00 | $975.00 | $1,200.00 | $4,500.00 |
Standard pricing, no caching, no batch. Rounded to the nearest cent. Multiply your actual workload by the per-million numbers in the table above for an exact figure.
One thing the table hides: cached input. Each provider gives you an order-of-magnitude discount on cache reads, and the discounts are not equal. V4-Flash drops to $0.003 per million on a cache hit (a fiftieth of the miss price). Haiku 4.5 drops to $0.10, which is a tenth. For a chatbot that reuses a 50K-token system prompt across thousands of calls, the effective input price on Haiku 4.5 lands much closer to V4-Flash than the headline numbers suggest. Not equal, but closer.
What you actually get for each price
None of these models are equivalent. Each one has a shape that the price tag does not capture. The table below is what each provider actually ships you.
| Capability | V4-Flash | Nano | Flash-Lite | Haiku 4.5 |
|---|---|---|---|---|
| Vision input | No | Yes | Yes | Yes |
| Audio input | No | No | Yes | No |
| Video input | No | No | Yes | No |
| Reasoning mode | Three (off / high / max) | Standard | On by default | Extended thinking |
| Computer use | No | No | No | Yes |
| Open weights | MIT | No | No | No |
| Max output | 384K | 128K | 65K | 64K |
| Speed (tok/sec) | Not published | Not published | ~363 | ~91 |
Speed estimates from Artificial Analysis measurements where the provider does not publish official tok/sec figures.
A few things stand out. Gemini 3.1 Flash-Lite is the only one of the four that handles audio and video input as a first-class capability, which makes the $0.25 input price look very different if your application processes anything other than text. Haiku 4.5 is the only one that supports computer use natively, so anyone building Anthropic-stack agents that drive a browser or desktop has exactly one budget option, and this is it.
V4-Flash is the only one with open weights. For teams running their own inference, that removes the cost question entirely and replaces it with a hosting question. The model card on Hugging Face confirms 284B total parameters with 13B active per token (MoE), so a single 8xH100 node can in principle serve it, though reasonable-throughput production serving needs more.
Benchmark scores, with the asterisks intact
Direct head-to-head benchmarks across all four models on the same eval suite do not exist. Each provider published scores against the benchmarks they wanted to feature and against the competitors they chose. So the table below is a stitched-together view, and you should read it as "what each provider claims" rather than "a referee's scoreboard."
| Benchmark | V4-Flash (Think Max) | Flash-Lite | Haiku 4.5 | Nano |
|---|---|---|---|---|
| MMLU-Pro | 86.2% | 83.0% | n/d | n/d |
| GPQA Diamond | n/d | 72.2% | n/d | n/d |
| LiveCodeBench | 91.6% | 69.9% | n/d | n/d |
| SWE-bench Verified | 79.0% | n/d | 73.3% | n/d |
| SWE-bench Pro | n/d | n/d | n/d | 52.4% |
| AIME 2025 | n/d | 16.7% | n/d | n/d |
| Codeforces | 3052 | n/d | n/d | n/d |
| AA Intelligence Index | n/d | n/d | 31 | ~27 |
n/d = not disclosed by the provider. AA Index is from Artificial Analysis (April 2026). DeepSeek scores are from the V4 model card and require Think Max mode.
Two readings to take from this. First, V4-Flash with Think Max enabled is the only budget-tier model with a public SWE-bench Verified score in the high 70s, which puts it in the same neighborhood as Haiku 4.5 (73.3%) on coding. The math on cost-per-fix is brutal there. Second, Gemini 3.1 Flash-Lite is the most-benchmarked of the four, and the Google-disclosed numbers paint it as the best general-purpose budget reasoner even if it loses to V4-Flash on coding.
What the public scores can't tell you: how each model behaves on your prompts. The variance between "eval-set quality" and "production quality" at this tier is substantially larger than at the frontier. Run your eval before you commit.
Where each one fits, in concrete terms
For batch coding, document processing, and RAG over very large corpora, DeepSeek V4-Flash is the obvious pick. The 1M context plus 384K max output is the largest output budget of the four. Three reasoning modes give you a real cost-quality dial. The catch is operational: no AWS Bedrock, no Azure, no Vertex. Direct DeepSeek API or self-host. For US-based enterprises with procurement requirements around model hosting, this is a hard wall.
If you are already on OpenAI, GPT-5.4 Nano tends to win by default. The pricing is sane, the 400K context is adequate for most non-RAG use cases, and the cached-input price drops to $0.02/M. Where it loses: OpenAI hasn't published a meaningful benchmark sheet for Nano. The SWE-bench Pro score of 52.4% is the only public number worth quoting, and it lands below Haiku 4.5 on the same benchmark family. Reach for it on classification, extraction, and cheap routing layers in a multi-model pipeline.
Multimodal apps are where Gemini 3.1 Flash-Lite earns its place. Audio, video, and image input at $0.25/$1.50 with a 1M context window is unmatched in this tier. Output streams at roughly 363 tokens per second, which is the difference between a snappy chat UI and a sluggish one. The MMLU-Pro 83% and BFCL v3 76.5% scores show it holds up on function-calling agents.
Claude Haiku 4.5 is the premium of the four. You pay seven times what V4-Flash costs, four times what Nano costs, and what you get is computer use, extended thinking, vision, and the Anthropic safety stack. Anyone building agents on Anthropic's tooling has no real cheap-tier alternative; Haiku 4.5 IS the cheap tier on that stack. The 50% batch discount and the 10x cache-read discount soften the gap on heavy production workloads, but they don't close it.
What changed because of V4-Flash
Two months ago, the cheap-tier price floor was around $0.20/$1.25, set by GPT-5.4 Nano and bracketed by Gemini 3 Flash. V4-Flash dropped that floor to $0.14/$0.28, and the output price is what really hurts the closed-model competition. $0.28 vs $1.25 means a workload that emits 100M output tokens a year costs $28K on V4-Flash and $125K on Nano. That gap is large enough that some teams will now run a router: V4-Flash for the bulk, a closed model for the calls that need vision, computer use, or specific provider lock-in.
The closed-model providers have not responded with a price cut, and probably won't. What they will do, and what they are already doing, is bundle capabilities V4-Flash doesn't have: native multimodal, computer use, batch discounts, deep ecosystem integrations. That bundling is real product value, not just lock-in. Whether it justifies paying anywhere from four to eighteen times more on output depends entirely on whether you actually use those capabilities.
For most code-and-text workloads in May 2026, V4-Flash is the rational default and the burden of proof has shifted to anyone who picks something else. That is not a permanent state. But it's where the math sits today.
Sources
- - DeepSeek V4-Flash release notes: api-docs.deepseek.com
- - DeepSeek V4-Flash model card on Hugging Face: huggingface.co/deepseek-ai/DeepSeek-V4-Flash
- - Claude Haiku 4.5 announcement: anthropic.com/news/claude-haiku-4-5
- - Gemini 3.1 Flash-Lite launch post: blog.google
- - Gemini API pricing: ai.google.dev
- - GPT-5.4 Nano pricing on OpenRouter: openrouter.ai
- - Artificial Analysis Intelligence Index: artificialanalysis.ai
- - Gemini 3.1 Flash-Lite model card: deepmind.google