How much does Gemma 4 cost to run?

Gemma 4 is open-weight (Apache 2.0), so self-hosting is free beyond compute costs. On OpenRouter, the 31B model costs $0.14 per million input tokens and $0.40 per million output tokens. The 26B MoE variant costs $0.13/$0.40. Google has not yet added Gemma 4 to its managed API pricing.

What is the difference between Gemma 4 31B and 26B MoE?

The 31B is a dense model using all 30.7B parameters per token. The 26B A4B is a mixture-of-experts model with 25.2B total parameters but only 3.8B active per token, making it faster and cheaper to run ($0.13 vs $0.14/M input) with near-identical benchmark scores (88.3% vs 89.2% on AIME 2026).

How does Gemma 4 compare to Gemma 3?

Gemma 4 31B scores 89.2% on AIME 2026 vs Gemma 3 27B at 20.8%. On GPQA Diamond: 84.3% vs 42.4%. Codeforces ELO: 2150 vs 110. The generation-over-generation improvement is one of the largest single jumps seen in an open model family.

Does Gemma 4 support reasoning and thinking mode?

Yes. All four Gemma 4 models include native reasoning/thinking mode that can be toggled on or off. This is built into the base model, not a separate variant. The models also support native function calling and system prompts.

Model ReleaseApril 5, 2026·8 min read

Gemma 4 is out: $0.14 per million tokens for a 31B model scoring 89% on AIME

Google DeepMind dropped four open-weight models on April 2. The headline number: Gemma 4 31B scores 89.2% on AIME 2026 and 84.3% on GPQA Diamond, available on OpenRouter for $0.14 per million input tokens. Gemma 3 27B scored 20.8% on AIME. That's not a typo.

Close-up of a dark circuit board with teal lighting

Photo by Adi Goldstein on Unsplash

Four models, two architectures

Gemma 4 ships as four instruction-tuned models, all Apache 2.0 licensed. Two are dense transformers, two use a per-layer embedding trick that makes them smaller than their parameter count suggests. All four have native reasoning mode built in - you toggle it on, no separate model needed.

Model	Total params	Active params	Context	Modalities
Gemma 4 31B Dense	30.7B	30.7B	256K	Text, image, video
Gemma 4 26B A4B MoE	25.2B	3.8B	256K	Text, image, video
Gemma 4 E4B Dense	8B	4.5B	128K	Text, image, audio
Gemma 4 E2B Dense	5.1B	2.3B	128K	Text, image, audio

"E" models use Per-Layer Embeddings for on-device efficiency. "A4B" = 3.8B active parameters per token in the MoE architecture. All models support native system prompts and function calling.

The split between the 31B and 26B MoE is the one worth paying attention to. The MoE activates only 3.8B parameters per forward pass but matches the 31B on most benchmarks. It runs at roughly the same speed as a 4B model while producing output quality that would have been frontier-class a year ago.

What it costs on OpenRouter

Gemma 4 is open-weight, so self-hosting costs depend on your hardware. For API access, OpenRouter has the 31B and 26B MoE live. The E4B and E2B aren't on providers yet. Google hasn't added Gemma 4 to its managed API (Vertex AI still lists Gemma 3 variants only).

Model	Input / 1M	Output / 1M	Context	Provider
Gemma 4 31B	$0.14	$0.40	256K	OpenRouter
Gemma 4 26B A4B	$0.13	$0.40	256K	OpenRouter
Gemma 3 27B IT	$0.10	$0.10	128K	OpenRouter
Llama 4 Scout	$0.08	$0.30	512K	OpenRouter
DeepSeek V3.2	$0.28	$0.42	128K	DeepSeek API
GPT-5.4 Nano	$0.20	$1.25	400K	OpenAI

Prices as of April 5, 2026. OpenRouter pricing varies by underlying provider. Gemma 3 27B and Llama 4 Scout included for context.

The MoE costing less than the dense model ($0.13 vs $0.14 input) despite similar benchmark scores is the cost story here. With only 3.8B active parameters, inference providers can fit more concurrent requests on the same GPU, and they pass that saving on. If you care about cost and are fine with a ~1 point benchmark sacrifice on AIME, the 26B MoE is the better pick.

Compared to other budget models: Gemma 4 26B MoE at $0.13 input is a penny more than Llama 4 Scout ($0.08) but with significantly higher reasoning scores. It costs less than half what DeepSeek V3.2 charges ($0.28). The gap against GPT-5.4 Nano ($0.20 input, $1.25 output) is less dramatic on input, but Nano's output cost is roughly triple.

Benchmark scores: what happened between Gemma 3 and 4

We don't normally get generational jumps like this in open models. The numbers below are all from instruction-tuned variants with reasoning mode enabled.

Benchmark	Gemma 4 31B	26B MoE	Gemma 3 27B	Jump
AIME 2026 Math competition	89.2%	88.3%	20.8%	+68.4 pts
GPQA Diamond Graduate science	84.3%	82.3%	42.4%	+41.9 pts
MMLU Pro Broad knowledge	85.2%	82.6%	67.6%	+17.6 pts
LiveCodeBench v6 Code generation	80.0%	77.1%	29.1%	+50.9 pts
Codeforces ELO Competitive coding	2150	1718	110	+2040
MMMU Pro Vision reasoning	76.9%	73.8%	49.7%	+27.2 pts

Scores from Google DeepMind's Gemma page and HuggingFace model cards. All instruction-tuned with reasoning mode. Gemma 3 scores from Google's published benchmarks.

The Codeforces ELO jump from 110 to 2150 is the one that made people stop scrolling. ELO 2150 puts Gemma 4 31B in the Expert tier on Codeforces, up from below Newbie for Gemma 3. On the HuggingFace discussion page, one commenter put it well: the model didn't release, it escaped.

GPQA Diamond at 84.3% from a 31B open model is something to sit with. For reference, our reasoning models comparison showed DeepSeek R1 at 81.0% and o4-mini at 81.4% on the same benchmark. Gemma 4 beats both. The R1 costs $0.55/M input. o4-mini costs $1.10/M. Gemma 4 costs $0.14/M.

The MoE trailing by 1-3 points across the board is fine - arguably even ideal from a cost perspective. You lose almost nothing and gain inference speed. The exception is Codeforces ELO where the gap is wider (2150 vs 1718), which suggests the full dense model handles highly competitive coding problems better.

The MoE variant is the interesting one for production

Most of the social media attention went to the 31B's AIME score. But the 26B A4B is the model that will actually get deployed. It activates 3.8B parameters per token out of 25.2B total, which means it runs at roughly the throughput of a 4B-class model while producing results within 1-3 points of a 31B dense model.

At $0.13/M input on OpenRouter, it sits in a price bracket with models that score 40-60 points lower on most benchmarks. It costs roughly 70% less than Mistral Small 4 ($0.15/M input) on comparable benchmarks. And because the active parameter count is so low, it has better latency characteristics than any dense model at similar quality levels.

For teams self-hosting, the MoE is even more attractive. 3.8B active parameters means it can run on a single consumer GPU with quantization, while the 31B needs more substantial hardware. The HuggingFace page already has 42+ community quantizations posted for the 31B, and the MoE versions are following fast.

What's not there yet

No Google API pricing. Vertex AI and AI Studio both still list Gemma 3 variants. Given that the weights dropped on April 2 and OpenRouter had them up within a day, Google's managed API will probably follow soon, but right now you can't run Gemma 4 through Google's own infrastructure with official pricing.

The E4B and E2B (the smaller audio-capable models) aren't on any hosted provider yet. These are the ones with native audio input - up to 30 seconds of speech for ASR and translation. If that matters for your use case, you'll need to self-host for now.

Training data goes through January 2025. For a model released April 2, 2026, that's a 14-month knowledge gap. It won't know about models or pricing changes from 2025 onward without retrieval augmentation.

So what?

An open-weight model at $0.13-0.14/M input just posted benchmark scores that match or beat models costing 4-8x more. The 26B MoE variant in particular makes it hard to justify a lot of the mid-tier paid API pricing - unless you need features Gemma 4 doesn't have (like longer context, native computer use, or managed SLAs).

We've added both models to our pricing table. If you're currently paying for reasoning capability from a closed provider, run the numbers on whether Gemma 4 26B MoE gets you close enough for your workload. At a penny per ten thousand input tokens, the switching cost for testing is basically zero.

Sources

Gemma model page - Google DeepMind (official)
Gemma 4 31B IT model card - HuggingFace (April 2, 2026)
Gemma 4 26B A4B IT model card - HuggingFace (April 3, 2026)
Gemma 4 31B OpenRouter listing - OpenRouter (pricing)
Gemma 4 26B MoE OpenRouter listing - OpenRouter (pricing)

Compare all model prices Calculate your API costs