What is Gemini Flex inference?

Flex inference is a Gemini API tier that charges 50% less than standard pricing, with 1-15 minute response times. Unlike the Batch API, it uses the same synchronous generateContent endpoint -- you just add service_tier: 'flex' to your request. No async job management required.

What is Gemini Priority inference?

Priority inference costs 75-100% more than standard pricing and provides guaranteed fast (seconds-level) response times even during traffic spikes. It is only available to Tier 2 and Tier 3 API accounts, which requires at least $100 in real paid charges.

How much does Gemini Flex inference cost?

Flex pricing is 50% of standard rates. For Gemini 2.5 Flash: $0.15 input / $1.25 output per 1M tokens. For Gemini 3.1 Pro Preview: $1.00 input / $6.00 output per 1M tokens (under 200K context). Same discount as the Batch API.

Can Gemini Priority inference silently downgrade to Standard?

Yes. When Priority rate limits are exceeded, requests fall through to Standard and are billed at Standard rates -- with no error or notification. You have to check the x-gemini-service-tier response header to detect this.

GuideApril 4, 2026·8 min read

Gemini Flex and Priority inference: how Google's new tiers work and what they cost

Google added Flex and Priority inference to the Gemini API on April 2. Flex is 50% off standard pricing, synchronous, and takes 1-15 minutes to respond. Priority costs 75-100% more and guarantees fast responses -- but only if you're Tier 2, and there's a silent downgrade you need to know about.

Abstract light tunnel with radiating golden fiber optic lines on dark background

Photo by JJ Ying on Unsplash

Until now, the Gemini API had two pricing modes: standard (full price, fast) and batch (50% off, up to 24 hours, async). Google's April 2 announcement added Flex and Priority to fill in the gaps. Flex brings batch-level pricing to synchronous requests -- same endpoint, same call pattern, just slower. Priority lets you pay more to stay out of the throttle queue during load spikes. Neither is obviously right for everyone. Which one matters to you depends on whether your bottleneck is cost or latency variance.

The four tiers side by side

Tier	Pricing	Latency	Interface	Access
Priority	+75-100% over standard	Seconds	Synchronous	Tier 2+ only
Standard	Full price	Seconds to minutes	Synchronous	Tier 1+
Flex	50% off standard	1-15 minutes	Synchronous	Tier 1+
Batch	50% off standard	Up to 24 hours	Asynchronous	Tier 1+

Batch pricing is the same as Flex (50% off), but uses an asynchronous interface with job IDs and up to 24-hour completion windows. Flex and Batch are priced identically.

Flex: batch pricing without the async headache

Flex costs the same as Batch -- 50% off standard -- but it's synchronous. You call the same generateContent endpoint and add one parameter: service_tier: "flex". You wait for the response in the same call. No JSONL files, no polling for job status, no managing batch objects.

The catch is latency. Flex requests can sit in a queue for up to 15 minutes before Google processes them. For agentic workflows where each step depends on the previous output, 15 minutes per step adds up fast. But for background tasks that don't need to complete immediately -- data extraction pipelines, report generation, embedding jobs -- Flex is often a better fit than Batch because you don't have to build the async infrastructure around it.

One thing the docs are clear about: if Flex capacity is full, you get a 503 or 429. Requests don't silently upgrade to Standard and charge you the higher rate. That's actually the right behavior -- you just need to implement retry logic with exponential backoff on your end.

Client timeout warning: The Gemini SDK's default timeout is well under 15 minutes. If you don't override it, your client disconnects from the Flex queue midway through and you get no response. Set your client timeout to at least 900,000ms (15 minutes).

Priority: paying for reliability, with one catch

Priority costs 80% more than standard -- not 75-100%, that's the range Google lists, but the actual multiplier is 1.8x when you do the math. Gemini 2.5 Flash at standard is $0.30 input / $2.50 output. At Priority: $0.54 / $4.50. In exchange, your requests skip the queue and get processed in seconds, even during high load.

The thing most developers will miss: Priority silently downgrades to Standard when you exceed the Priority rate limits. You get a Standard-speed response billed at Standard rates -- and there's no error, no flag in the response body. The only way to confirm you're actually getting Priority is checking the x-gemini-service-tier response header. If it says standard when you expected priority, you've been downgraded.

Priority rate limits are also lower than Standard -- about 0.3x the normal RPM per model. This isn't a bug in the docs; it's by design. Priority is for burst use cases where you occasionally need guaranteed latency, not for sustained high-throughput production traffic. If you need both, you're in Tier 3 territory.

Access requirement: Tier 2 accounts only. That means at least $100 in real paid charges (not free trial credits, not promotional credits) and a 3-day wait from your first successful payment. There's no workaround for new accounts.

Full pricing by model and tier

All prices per 1 million tokens. Batch pricing is identical to Flex. Models marked "Preview" have more restrictive rate limits regardless of tier.

Model	Standard	Flex (50% off)	Priority (+80%)
Gemini 3.1 Pro PreviewPrices above 200K context window input tokens: 2x	$2.00 in $12.00 out	$1.00 in $6.00 out	$3.60 in $21.60 out
Gemini 2.5 ProPrices above 200K context window input tokens: 2x	$1.25 in $10.00 out	$0.63 in $5.00 out	$2.25 in $18.00 out
Gemini 3 Flash Preview	$0.50 in $3.00 out	$0.25 in $1.50 out	$0.90 in $5.40 out
Gemini 2.5 Flash	$0.30 in $2.50 out	$0.15 in $1.25 out	$0.54 in $4.50 out
Gemini 3.1 Flash-Lite Preview	$0.25 in $1.50 out	$0.125 in $0.75 out	$0.45 in $2.70 out
Gemini 2.5 Flash-Lite	$0.10 in $0.40 out	$0.05 in $0.20 out	$0.18 in $0.72 out

Context caching is available on all tiers. Cached input costs more on Priority (example: Gemini 3.1 Pro cache is $0.36/M on Priority vs $0.20/M on Standard). Audio input is priced separately and also scales with the tier multiplier.

Which tier actually makes sense for your use case

Flex is the right call if you have agentic workflows where steps are sequential but don't need to complete in real time -- think overnight report generation, classification pipelines you kick off and check later, or batch data processing where the results feed into the next day's job. Flex beats Batch specifically when your pipeline is sequential rather than purely parallel. Batch requires you to manage JSONL files and poll for completion; Flex keeps everything in one synchronous call that you just... wait for.

Priority makes sense for user-facing features where a multi-second delay causes a real problem -- chat interfaces, interactive document analysis, anything where the user is watching a spinner. It also helps in production environments with spiky traffic where standard can get throttled during peaks. Whether the 80% premium is worth it depends on how much you care about latency variance vs. average latency. If you just need fast on average, standard is usually fine. If you need fast even at the 95th percentile, Priority is what you're paying for.

Standard remains the default for most use cases. If you're not optimizing for either budget or guaranteed latency, there's no reason to change tiers.

Five things to know before switching tiers

1. Flex client timeout

The Gemini SDK's default timeout is far shorter than 15 minutes. If you don't override it, your client disconnects from the Flex queue and gets no response. Set your client timeout to 900,000ms (15 minutes) at minimum.

2. Priority silent downgrade

When you exceed Priority rate limits, requests silently fall through to Standard -- billed at Standard rates, no error thrown. Check the x-gemini-service-tier response header if you need to confirm you're actually getting Priority.

3. Priority rate limits are lower, not higher

Priority accounts get roughly 0.3x the standard RPM per model. It's designed for burst use cases, not sustained high throughput. If you need both guaranteed latency and high throughput, you'll need Tier 3 and a conversation with Google Cloud sales.

4. Priority needs Tier 2 ($100+ real spend, 3-day wait)

New accounts and Tier 1 accounts cannot use Priority inference at all. Tier 2 requires $100 in real billed charges (not free credits or promotional credits) plus at least 3 days from your first successful payment.

5. Context caching costs more on Priority

Cache pricing is also tiered. For Gemini 3.1 Pro: cached input costs $0.36/M on Priority vs $0.20/M on Standard. If you rely heavily on caching, the premium is wider than the headline 80% looks.

Sources

Compare all Gemini model pricing Calculate your API costs