What is the cheapest LLM API in May 2026?

DeepSeek V4-Flash at $0.14/M input and $0.28/M output is the cheapest LLM API among the four main budget-tier models. Claude Haiku 4.5 is the most expensive at $1.00/M input and $5.00/M output. GPT-5.4 Nano sits at $0.20/$1.25 and Gemini 3.1 Flash-Lite at $0.25/$1.50. On a 1M-input, 100K-output workload, V4-Flash costs $0.17 vs $1.50 for Haiku 4.5, an 8.9x gap.

Is DeepSeek V4-Flash as good as Claude Haiku 4.5?

On the benchmarks each provider published, V4-Flash in Think Max mode scores 79.0% on SWE-bench Verified vs 73.3% for Haiku 4.5. V4-Flash also scores 91.6% on LiveCodeBench and 3052 on Codeforces. Haiku 4.5 has lower headline coding numbers but ships with extended thinking, computer use, vision, and the Anthropic safety stack that some enterprise teams require. Quality on your specific task needs an eval, not a leaderboard.

What is the context window for budget-tier models?

DeepSeek V4-Flash and Gemini 3.1 Flash-Lite both support 1 million tokens. GPT-5.4 Nano supports 400,000 tokens. Claude Haiku 4.5 supports 200,000 tokens. For long-context retrieval, RAG over large document corpora, or whole-repo coding tasks, V4-Flash and Flash-Lite are the only two with the headroom.

Does GPT-5.4 Nano support vision?

Yes. GPT-5.4 Nano accepts text and image input. Claude Haiku 4.5 also supports vision. Gemini 3.1 Flash-Lite is the most capable on the multimodal axis: it accepts text, image, audio, and video. DeepSeek V4-Flash is text-only.

ComparisonMay 1, 2026·9 min read

DeepSeek V4-Flash redrew the budget LLM tier. Here is where Haiku 4.5, GPT-5.4 Nano, and Gemini Flash-Lite now sit.

On April 24, DeepSeek shipped a $0.14/$0.28-per-million-token model with a 1M context window and three reasoning modes. That single price point reset the floor for the entire budget tier. Eight days later, we have enough data to say what each of the four cheap-tier flagships is actually for.

Tiny backlit electronic component on a dark surface, representing small efficient budget-tier AI models

Photo by Jason Jarrach on Unsplash

What each one costs

Model	Input / 1M	Output / 1M	Cache read	Batch	Context
DeepSeek V4-Flash	$0.14	$0.28	$0.003	n/a	1M / 384K out
GPT-5.4 Nano	$0.20	$1.25	$0.02	$0.10/$0.625	400K / 128K out
Gemini 3.1 Flash-Lite	$0.25	$1.50	$0.025	$0.125/$0.75	1M / 65K out
Claude Haiku 4.5	$1.00	$5.00	$0.10	$0.50/$2.50	200K / 64K out

Standard published rates as of May 1, 2026. DeepSeek lowered all cache-read prices to one-tenth of launch rates on April 26. Gemini Flash-Lite Priority tier is $0.45/$2.70.

The shape of the gap

Sort the four by output price and the spread is bigger than most people assume: $0.28, $1.25, $1.50, $5.00. Haiku 4.5 charges almost eighteen times what V4-Flash charges per million output tokens. Even the next-cheapest closed model, GPT-5.4 Nano, comes in at roughly four-and-a-half the V4-Flash output rate. So the cheap-tier conversation in May 2026 is really two conversations: V4-Flash vs everything else, and Haiku 4.5 vs the other two closed models.

On a workload that splits roughly even between input and output, here is what the month looks like at three volumes:

Workload	V4-Flash	Nano	Flash-Lite	Haiku 4.5
Daily summarizer (1M in / 100K out per day, 30-day month)	$5.04	$9.75	$12.00	$45.00
Mid-volume RAG (10M in / 1M out per day)	$50.40	$97.50	$120.00	$450.00
High-volume extraction (100M in / 10M out per day)	$504.00	$975.00	$1,200.00	$4,500.00

Standard pricing, no caching, no batch. Rounded to the nearest cent. Multiply your actual workload by the per-million numbers in the table above for an exact figure.

One thing the table hides: cached input. Each provider gives you an order-of-magnitude discount on cache reads, and the discounts are not equal. V4-Flash drops to $0.003 per million on a cache hit (a fiftieth of the miss price). Haiku 4.5 drops to $0.10, which is a tenth. For a chatbot that reuses a 50K-token system prompt across thousands of calls, the effective input price on Haiku 4.5 lands much closer to V4-Flash than the headline numbers suggest. Not equal, but closer.

What you actually get for each price

None of these models are equivalent. Each one has a shape that the price tag does not capture. The table below is what each provider actually ships you.

Capability	V4-Flash	Nano	Flash-Lite	Haiku 4.5
Vision input	No	Yes	Yes	Yes
Audio input	No	No	Yes	No
Video input	No	No	Yes	No
Reasoning mode	Three (off / high / max)	Standard	On by default	Extended thinking
Computer use	No	No	No	Yes
Open weights	MIT	No	No	No
Max output	384K	128K	65K	64K
Speed (tok/sec)	Not published	Not published	~363	~91

Speed estimates from Artificial Analysis measurements where the provider does not publish official tok/sec figures.

A few things stand out. Gemini 3.1 Flash-Lite is the only one of the four that handles audio and video input as a first-class capability, which makes the $0.25 input price look very different if your application processes anything other than text. Haiku 4.5 is the only one that supports computer use natively, so anyone building Anthropic-stack agents that drive a browser or desktop has exactly one budget option, and this is it.

V4-Flash is the only one with open weights. For teams running their own inference, that removes the cost question entirely and replaces it with a hosting question. The model card on Hugging Face confirms 284B total parameters with 13B active per token (MoE), so a single 8xH100 node can in principle serve it, though reasonable-throughput production serving needs more.

Benchmark scores, with the asterisks intact

Direct head-to-head benchmarks across all four models on the same eval suite do not exist. Each provider published scores against the benchmarks they wanted to feature and against the competitors they chose. So the table below is a stitched-together view, and you should read it as "what each provider claims" rather than "a referee's scoreboard."

Benchmark	V4-Flash (Think Max)	Flash-Lite	Haiku 4.5	Nano
MMLU-Pro	86.2%	83.0%	n/d	n/d
GPQA Diamond	n/d	72.2%	n/d	n/d
LiveCodeBench	91.6%	69.9%	n/d	n/d
SWE-bench Verified	79.0%	n/d	73.3%	n/d
SWE-bench Pro	n/d	n/d	n/d	52.4%
AIME 2025	n/d	16.7%	n/d	n/d
Codeforces	3052	n/d	n/d	n/d
AA Intelligence Index	n/d	n/d	31	~27

n/d = not disclosed by the provider. AA Index is from Artificial Analysis (April 2026). DeepSeek scores are from the V4 model card and require Think Max mode.

Two readings to take from this. First, V4-Flash with Think Max enabled is the only budget-tier model with a public SWE-bench Verified score in the high 70s, which puts it in the same neighborhood as Haiku 4.5 (73.3%) on coding. The math on cost-per-fix is brutal there. Second, Gemini 3.1 Flash-Lite is the most-benchmarked of the four, and the Google-disclosed numbers paint it as the best general-purpose budget reasoner even if it loses to V4-Flash on coding.

What the public scores can't tell you: how each model behaves on your prompts. The variance between "eval-set quality" and "production quality" at this tier is substantially larger than at the frontier. Run your eval before you commit.

Where each one fits, in concrete terms

For batch coding, document processing, and RAG over very large corpora, DeepSeek V4-Flash is the obvious pick. The 1M context plus 384K max output is the largest output budget of the four. Three reasoning modes give you a real cost-quality dial. The catch is operational: no AWS Bedrock, no Azure, no Vertex. Direct DeepSeek API or self-host. For US-based enterprises with procurement requirements around model hosting, this is a hard wall.

If you are already on OpenAI, GPT-5.4 Nano tends to win by default. The pricing is sane, the 400K context is adequate for most non-RAG use cases, and the cached-input price drops to $0.02/M. Where it loses: OpenAI hasn't published a meaningful benchmark sheet for Nano. The SWE-bench Pro score of 52.4% is the only public number worth quoting, and it lands below Haiku 4.5 on the same benchmark family. Reach for it on classification, extraction, and cheap routing layers in a multi-model pipeline.

Multimodal apps are where Gemini 3.1 Flash-Lite earns its place. Audio, video, and image input at $0.25/$1.50 with a 1M context window is unmatched in this tier. Output streams at roughly 363 tokens per second, which is the difference between a snappy chat UI and a sluggish one. The MMLU-Pro 83% and BFCL v3 76.5% scores show it holds up on function-calling agents.

Claude Haiku 4.5 is the premium of the four. You pay seven times what V4-Flash costs, four times what Nano costs, and what you get is computer use, extended thinking, vision, and the Anthropic safety stack. Anyone building agents on Anthropic's tooling has no real cheap-tier alternative; Haiku 4.5 IS the cheap tier on that stack. The 50% batch discount and the 10x cache-read discount soften the gap on heavy production workloads, but they don't close it.

What changed because of V4-Flash

Two months ago, the cheap-tier price floor was around $0.20/$1.25, set by GPT-5.4 Nano and bracketed by Gemini 3 Flash. V4-Flash dropped that floor to $0.14/$0.28, and the output price is what really hurts the closed-model competition. $0.28 vs $1.25 means a workload that emits 100M output tokens a year costs $28K on V4-Flash and $125K on Nano. That gap is large enough that some teams will now run a router: V4-Flash for the bulk, a closed model for the calls that need vision, computer use, or specific provider lock-in.

The closed-model providers have not responded with a price cut, and probably won't. What they will do, and what they are already doing, is bundle capabilities V4-Flash doesn't have: native multimodal, computer use, batch discounts, deep ecosystem integrations. That bundling is real product value, not just lock-in. Whether it justifies paying anywhere from four to eighteen times more on output depends entirely on whether you actually use those capabilities.

For most code-and-text workloads in May 2026, V4-Flash is the rational default and the burden of proof has shifted to anyone who picks something else. That is not a permanent state. But it's where the math sits today.

Sources

- DeepSeek V4-Flash release notes: api-docs.deepseek.com
- DeepSeek V4-Flash model card on Hugging Face: huggingface.co/deepseek-ai/DeepSeek-V4-Flash
- Claude Haiku 4.5 announcement: anthropic.com/news/claude-haiku-4-5
- Gemini 3.1 Flash-Lite launch post: blog.google
- Gemini API pricing: ai.google.dev
- GPT-5.4 Nano pricing on OpenRouter: openrouter.ai
- Artificial Analysis Intelligence Index: artificialanalysis.ai
- Gemini 3.1 Flash-Lite model card: deepmind.google

Compare all model pricing Calculate your API costs