How can I reduce my LLM API costs without changing models?

Five proven tactics: prompt caching (50-90% off repeated tokens), Batch API (50% off for async processing), output token optimization (set max_tokens and use structured outputs), model routing (send simple queries to cheap models), and semantic caching (skip API calls for similar queries). Combined, these can reduce costs by 60-76%.

What is the OpenAI Batch API discount?

OpenAI's Batch API gives 50% off both input and output tokens. You upload requests as a JSONL file and get results within a 24-hour window (often faster). It works for any non-realtime workload like summarization, data extraction, or content generation.

How does prompt caching work for LLM APIs?

Prompt caching stores frequently-used prompt prefixes (like system prompts) so you don't pay full price each time. OpenAI does this automatically with discounts of 50-90% depending on the model. Anthropic offers 90% off cached reads with explicit cache breakpoints. Google Gemini also offers 90% off with both implicit and explicit caching.

Why do output tokens cost more than input tokens?

Output tokens cost 4-6x more than input tokens because generating text requires more computation than processing it. On GPT-5.4, input costs $2.50/1M tokens while output costs $15.00/1M. Controlling output length with max_tokens and structured outputs is one of the highest-leverage cost optimizations.

GuideMarch 20, 2026·10 min read

How to cut your LLM API bill by 60% without changing models

Five tactics that work with any provider. No model swaps, no quality tradeoffs. Just the same API calls, for less money. We tested each one and did the math.

Developer working on code optimization on a laptop screen

Photo by Markus Spiske on Unsplash

TL;DR

-Prompt caching saves 50-90% on repeated input tokens. All three major providers offer it. OpenAI's is automatic. Anthropic's gives 90% off with explicit breakpoints.
-Batch API gives 50% off if you can wait up to 24 hours. Stacks with caching for up to 95% off on Anthropic.
-Output tokens cost 4-6x more than input. Controlling output length with max_tokens and structured outputs is the single highest-leverage fix most people skip.
-Model routing sends simple queries to cheap models. GPT-4.1 nano is 375x cheaper than Claude Opus 4. RouteLLM (ICLR 2025) showed 85% cost reduction at 95% quality.
-Combined: stacking these tactics can cut a $15,000/month bill to around $3,600. That's 76% off without touching your model choice.

Most advice about reducing LLM costs starts with "switch to a cheaper model." That's fine if you can afford the quality hit. But what if you actually need GPT-4o or Claude Sonnet for your use case?

Turns out, there's a lot of money sitting on the table before you touch the model dropdown. Caching, batching, output control, routing - these are things every provider supports, and most developers never set up. We went through each one, checked the actual savings, and put together this guide.

Everything below works with the APIs you already use. No new infrastructure, no model fine-tuning, no quality loss.

1. Prompt caching: stop paying for the same tokens twice

If your API calls share a common system prompt, few-shot examples, or repeated context, you're paying full price for identical tokens on every request. Prompt caching fixes that.

All three major providers now offer some form of it, but the discounts and mechanics are different enough to matter.

Provider	Cache discount	Write cost	Min tokens	Automatic?
OpenAI (GPT-4o)	50% off	Free	1,024	Yes
OpenAI (GPT-4.1)	75% off	Free	1,024	Yes
OpenAI (GPT-5)	90% off	Free	1,024	Yes
Anthropic	90% off	1.25x	1,024-4,096	Auto + explicit
Google Gemini	90% off	~$1/MTok/hr	1,024-4,096	Implicit + explicit

OpenAI's version is the easiest to use because it's automatic. No code changes. If your prompt is 1,024+ tokens and shares a prefix with recent requests, OpenAI caches it for you. The discount depends on the model family - 50% for GPT-4o, 75% for GPT-4.1, 90% for GPT-5.

Anthropic's caching gives you more control. You mark content blocks with cache breakpoints and get 90% off on reads. The catch: cache writes cost 1.25x the normal input price. So you lose money if the cache is never read. But it breaks even after just one reuse.

One developer reported going from $720/month to $72/month after enabling Anthropic's prompt caching. That's a 90% cut, just from caching a system prompt that was being sent with every request.

2. Batch API: half price if you can wait

OpenAI, Anthropic, and Google all offer the same deal: upload a batch of requests, accept a 24-hour processing window, and pay 50% less on every token. In practice, small batches often come back in 1-2 hours.

This works for anything that doesn't need a real-time response. Summarization, data extraction, content generation, model evaluations, embeddings - if you're processing a queue, batch it.

The really interesting part is stacking. On Anthropic, the 50% batch discount multiplies with the 90% caching discount. So cached batch reads cost 5% of the base price. That turns Claude Sonnet's $3.00/MTok input into $0.15/MTok. That's 95% off.

The stacking math

Claude Sonnet 4 base input: $3.00/MTok

With caching (90% off): $0.30/MTok

With caching + batch (50% off on top): $0.15/MTok

That's 95% off the standard rate. The deepest discount available from any provider.

3. Output tokens are the real cost driver

This is the one most people miss. Output tokens cost 4-6x more than input tokens across every provider. On GPT-5.4, it's $2.50 in vs. $15.00 out. On Claude Opus 4, it's $15.00 in vs. $75.00 out.

If your model is generating a 500-word essay when you needed a one-line answer, you're paying 5x too much on the most expensive part of every call.

Model	Input / 1M	Output / 1M	Output is...
GPT-4o	$2.50	$10.00	4x more
GPT-5.4	$2.50	$15.00	6x more
Claude Sonnet 4	$3.00	$15.00	5x more
Claude Opus 4	$15.00	$75.00	5x more

Three fixes that take five minutes each:

Set max_tokens. For classification tasks, set it to 10-50 tokens. For summaries, set it proportional to the desired length. For data extraction, use structured outputs with a schema. You're throwing money away on every token past what you actually read.

Tell the model to be concise. "Respond in 2-3 sentences" or "Return only the JSON, no explanation." Models default to being helpful and verbose. You have to ask for brevity.

Use structured outputs. JSON mode or OpenAI's Structured Outputs feature forces responses into a schema. No prose wrapping, no "Here is your answer:" preamble. Just the data you need. This also eliminates retry calls from parsing failures.

Doing all three typically cuts output tokens by 30-70%. On a GPT-5.4 workload where output dominates the bill, that's a big number. Use our cost calculator to model how much you'd save for your specific token mix.

4. Model routing: use the expensive model only when you need it

Here's a number that surprised us: GPT-4.1 nano costs $0.05 per million input tokens. Claude Opus 4 costs $15.00. That's a 375x price gap. Even within OpenAI's own lineup, GPT-4o mini ($0.15) is 16.7x cheaper than GPT-4o ($2.50).

The question is: do all your queries actually need the expensive model? In most production systems, the answer is no. Simple FAQ lookups, classification tasks, format conversions - these don't need frontier intelligence. They need a model that's good enough. And "good enough" is 10-375x cheaper.

RouteLLM, a framework published at ICLR 2025 by the LMSYS team, demonstrated this at scale. Their router achieved 95% of GPT-4's quality while sending only 26% of requests to GPT-4. The rest went to cheaper models. Cost reduction: roughly 85%.

You don't need anything that fancy to get started. A simple classifier based on query length and keyword matching can route 60-80% of typical traffic to the cheap tier. Even that gets you 30-70% savings.

Compare model prices on our full pricing table to find the right cheap model for your routing setup, or check the leaderboard to find models that score well for less money.

5. Semantic caching: skip the API call entirely

Prompt caching reduces the cost of a call. Semantic caching eliminates the call altogether. The idea: use embeddings to match incoming queries against previous ones. If a new question is semantically similar enough to one you've already answered, serve the cached response.

This works best for applications with repetitive queries - customer support, FAQ bots, educational platforms. A developer on Dev.to reported 72% cost reduction with an 87.5% cache hit rate. Another implementation documented on Medium showed 37.7% savings in the first month with a 62% hit rate.

GPTCache (open source, by Zilliz) is the most popular tool for this. It uses an embedding model plus a vector store to find similar queries, and it integrates with LangChain and LlamaIndex.

The main risk: similar questions sometimes have different correct answers. "Weather in NYC" asked on two different days should not return the same answer. You need to tune the similarity threshold carefully and add invalidation rules for time-sensitive data.

Putting it together: $15,000 to $3,600

Here's what these tactics look like combined on a real workload. Take a team running 1M requests/month on GPT-4o, averaging 2,000 input and 1,000 output tokens per request.

Tactic	Monthly savings	How
Baseline	$15,000	$5K input + $10K output
Model routing (70% to mini)	-$10,500	70% of calls at 16.7x less
Prompt caching (50% hit rate)	-$675	50% off cached inputs on remaining GPT-4o
Output optimization (30% fewer)	-$900	Structured outputs + max_tokens
Batch processing (20% of requests)	-$300	Non-urgent work at 50% off
After optimization	~$3,600	76% reduction

The biggest single lever is routing. If most of your queries can go to a cheaper model, that alone gets you more than half the savings. Caching and output optimization are smaller individually but compound nicely. And batching is free money for any non-realtime workload.

Where to start

If you only do one thing, check your output tokens. Look at your average response length and ask whether you actually need all of it. Setting max_tokens and asking for concise responses is a five-minute change that can cut your bill by 20-30%.

If you're on OpenAI, prompt caching is already happening automatically. Check your API dashboard to see how many cached tokens you're getting. If the number is low, restructure your prompts to put static content first and variable content last.

For routing, start simple. Split your queries into "needs reasoning" and "needs a quick answer." Send the quick answers to GPT-4o mini or Haiku. You can always get more sophisticated later.

The bottom line

You don't need to downgrade your model to spend less. Caching, batching, output control, and routing work with whatever model you're already using. The savings stack, and most of them take less than an afternoon to set up.

Want to see how much each model costs right now? Check our live pricing table (updated every six hours) or plug your numbers into the cost calculator.

Sources

1.OpenAI Prompt Caching - platform.openai.com
2.OpenAI Batch API - help.openai.com
3.Anthropic Prompt Caching - platform.claude.com
4.Google Gemini Context Caching - ai.google.dev
5.RouteLLM: Learning to Route LLMs (ICLR 2025) - arxiv.org
6.GPTCache by Zilliz - github.com
7.OpenAI Structured Outputs - openai.com
8.Prompt caching: $720 to $72/month (case study) - medium.com
9.Semantic cache: 72% cost reduction (case study) - dev.to
10.Microsoft LLMLingua (prompt compression) - github.com

Compare All Model Pricing Calculate Your API Costs