How much does DiffusionGemma 26B cost to use?

DiffusionGemma 26B has no per-token API price. Google released it under Apache 2.0 as open weights on June 10, 2026, so the only way to run it is to self-host or use a free tier in Vertex AI Model Garden. On a rented H100 sustaining around 1,000 tokens per second, the all-in compute cost lands roughly between $0.19 and $0.91 per million output tokens depending on the GPU provider, before you account for idle time. That estimate assumes near-continuous utilization; a box that sits idle half the day doubles the effective rate.

Is DiffusionGemma faster than regular Gemma 4?

Yes. Google states DiffusionGemma generates up to 4x faster than the autoregressive Gemma 4 26B A4B it is built on. It reaches over 1,000 tokens per second on a single H100 and 700-plus on an RTX 5090. The speedup comes from text diffusion: instead of producing one token at a time, it denoises a 256-token canvas in parallel across about 12 to 16 steps. The catch is that the speedup is scoped to local and low-concurrency inference and shrinks under high-QPS cloud serving.

Does DiffusionGemma score lower than standard Gemma 4?

It does, and Google says so directly. The official guidance is that DiffusionGemma's output quality is lower than standard Gemma 4 and that quality-critical applications should deploy standard Gemma 4 instead. On Google's own table the diffusion model trails the autoregressive Gemma 4 26B A4B on every benchmark published: MMLU Pro 77.6 vs 82.6, GPQA Diamond 73.2 vs 82.3, LiveCodeBench v6 69.1 vs 77.1, and MATH-Vision 70.5 vs 82.4.

What hardware does DiffusionGemma 26B need to run?

DiffusionGemma is a 25.2B-parameter mixture-of-experts model with 3.8B active parameters. In its native NVFP4 four-bit quantization it fits within 18GB of VRAM, so it runs on a single high-end consumer GPU like an RTX 5090 or any datacenter card. It supports vLLM (the first diffusion model vLLM serves natively), Hugging Face Transformers, SGLang, MLX, and Unsloth. Ollama support is not available yet because llama.cpp does not implement the diffusion sampler.

Should I use DiffusionGemma or a hosted diffusion model like Mercury 2?

Mercury 2 is a hosted diffusion LLM at $0.25 input and $0.75 output per million tokens with no infrastructure to manage. DiffusionGemma is open weights you run yourself. If your traffic is steady enough to keep a GPU busy and you need data to stay on your own hardware, self-hosting DiffusionGemma can undercut Mercury 2 on compute cost. If traffic is bursty or you would rather not operate inference infrastructure, Mercury 2's fixed per-token price is the simpler economics and removes the idle-GPU problem entirely.

Model ReleaseJune 11, 2026·7 min read

DiffusionGemma generates 4x faster than its own sibling and scores worse on every benchmark Google published.

Google shipped DiffusionGemma 26B on June 10, an experimental open-weight model that writes text by denoising a canvas in parallel instead of one token at a time. It hits over 1,000 tokens per second on a single H100. It also has no price tag, because Google isn't selling it by the token. The awkward part: the autoregressive Gemma 4 it is built on is both smarter and already rentable for thirty cents a million. So we worked out the one place the speed actually earns its keep.

Image source: Google

Three things to hold in your head before the tables start:

It is a 25.2B-parameter MoE (3.8B active), Apache 2.0, with no per-token rate. You run it yourself or you do not run it.
The selling point is raw throughput: 1,000-plus tokens a second on a single H100, up to 4x the autoregressive Gemma 4 it shares a backbone with.
Google says, in writing, that it scores lower than that same Gemma 4 on every benchmark. So you are buying speed with accuracy, and that only pays back if you own the GPU and keep it busy.

A Gemma that paints the answer instead of typing it

Every model you have priced on this site so far is autoregressive. It predicts the next token, appends it, and predicts again, left to right, one step per token. DiffusionGemma works differently. It starts with a 256-token canvas of placeholder tokens and runs roughly 12 to 16 denoising passes over the whole block at once, locking in high-confidence tokens and using them to refine the rest. For longer outputs it commits a finished block to the cache and opens a fresh canvas. The practical effect is that the number of forward passes barely grows with output length, which is where the speed comes from.

The base is the same Gemma 4 26B A4B mixture-of-experts model Google shipped in April, with a diffusion head bolted on. It carries a 256K-token context window, takes text, image, and video input, and writes text out. It does not handle audio. Google labels it experimental and ships it in 18 quantized variants, with the native NVFP4 four-bit format squeezing the whole thing into 18GB of VRAM. That is the genuinely notable part for cost: an RTX 5090 has enough memory to run a model this capable at this speed on a desk.

Spec	DiffusionGemma 26B
Parameters	25.2B total, 3.8B active (MoE)
License	Apache 2.0 (open weights)
Context window	256K tokens
Modality	Text, image, video in; text out (no audio)
VRAM (NVFP4)	Fits within 18GB
Per-token API price	None - self-host only

The speed is real, and it depends entirely on the box

Google's throughput numbers are tied to specific hardware, which matters because self-hosting cost is throughput divided into the hourly GPU rate. The faster the chip, the cheaper each token. Here is what Google and NVIDIA report across the lineup.

Hardware	Throughput	Tier
NVIDIA DGX Station	up to 2,000 tok/s	Workstation
H100 (single)	1,000+ tok/s	Datacenter
RTX 5090	700+ tok/s	Consumer
NVIDIA DGX Spark	up to 150 tok/s	Desktop AI

The 4x figure Google quotes is measured against the autoregressive Gemma 4 26B A4B running normally, not against a competitor. Read the fine print and one caveat keeps recurring: the speedup is strongest at low to medium batch sizes on a single accelerator and gives diminishing returns under high-concurrency cloud serving. In other words, the diffusion advantage is a single-user, low-batch phenomenon. It makes an interactive local model feel instant. It does not automatically make a busy multi-tenant API cheaper.

What you give up: every benchmark on the card

Most model launches bury the regressions. Google did not. The DiffusionGemma announcement states plainly that its output quality is lower than standard Gemma 4, and recommends deploying standard Gemma 4 for any application that demands maximum quality. The model card backs that up with a head-to-head against the exact autoregressive sibling it was distilled from.

Benchmark	DiffusionGemma	Gemma 4 26B A4B	Gap
MMLU Pro	77.6	82.6	-5.0
GPQA Diamond	73.2	82.3	-9.1
LiveCodeBench v6	69.1	77.1	-8.0
MATH-Vision	70.5	82.4	-11.9

The gap widens as the task gets harder. Five points on MMLU Pro is the kind of difference you might shrug off for a 4x speedup. Nine points on graduate-level science and twelve on visual math are not. There is one bright spot Google highlights: DiffusionGemma is unusually responsive to fine-tuning. A base model that scored near zero on Sudoku reached 80 percent after task-specific tuning, solving puzzles in 12 denoising steps. The takeaway is that the off-the-shelf scores understate what the architecture can do on a narrow, well-defined task you train it for.

No price tag means you become the pricing department

When a model has no per-token rate, the cost does not disappear. It moves onto your cloud bill as GPU-hours, and you convert it back to dollars per million tokens yourself. The formula is simple: take the hourly rate of the GPU, divide by the tokens it produces per hour. At 1,000 tokens per second, an H100 turns out 3.6 million tokens in an hour of solid generation. Here is where that lands across common H100 rentals.

H100 source	Rate / hr	Cost / 1M output*
Vast.ai (marketplace low)	$0.67	~$0.19
RunPod H100 PCIe	$1.99	~$0.55
RunPod H100 SXM	$2.69	~$0.68
Lambda H100	$3.29	~$0.91

*These assume the GPU runs at full throughput continuously, generating a single stream. That is the best case and it almost never holds. The moment your box sits idle between requests, the effective cost per token climbs, because you pay for the hour whether or not it produces tokens. A node busy half the day doubles every figure in that last column. Input-token processing, batching gains, and setup overhead all push in different directions. Treat these as the floor, not the bill.

We have walked this self-host-versus-rent question before with Gemma 4 12B, and the conclusion rhymes: open weights are only free if your time and your idle GPUs are free.

The comparison Google would rather you not run

Put DiffusionGemma next to two models it competes with directly, and the picture gets uncomfortable. One is its own autoregressive sibling, which you can rent hosted on OpenRouter. The other is Mercury 2, the hosted diffusion LLM we covered in May. Both of those have a fixed price, no GPU to babysit, and in one case a higher benchmark score.

Model	Access	Output / 1M	Speed	Context
DiffusionGemma 26B	Self-host	~$0.19-0.91*	1,000+ tok/s	256K
Gemma 4 26B A4B	Hosted (OpenRouter)	$0.30	autoregressive	256K
Mercury 2	Hosted API	$0.75	~1,009 tok/s	128K

Look at the middle row. Gemma 4 26B A4B, the autoregressive model DiffusionGemma is distilled from, lists at $0.30 per million output tokens hosted, scores higher on every benchmark, and costs you nothing to operate. At the best-case self-host figures, DiffusionGemma only undercuts it on the marketplace-spot end, and even then you are absorbing all the idle-GPU risk to save maybe a dime per million while accepting a quality drop. The only thing you cannot buy from the hosted sibling is latency: 1,000 tokens per second on a box you control.

Mercury 2 is the cleaner comparison for the speed-first crowd. It matches DiffusionGemma's throughput (Artificial Analysis clocks it around 1,000 tok/s), it is a managed endpoint with no infrastructure, and at $0.75 per million output it prices in line with a mid-tier H100 rental running the open model. If your traffic is bursty, Mercury 2 wins on simplicity alone, because you never pay for an idle GPU. DiffusionGemma's edge appears only when traffic is steady, data has to stay on your own hardware, or you have fine-tuned it for a specific task where the base benchmark gap stops mattering.

Who this is actually for

Strip away the launch noise and DiffusionGemma fits a narrow, real slot. Run it if you operate your own GPUs at high utilization and latency is a product feature, not a nice to have: live autocomplete, streaming agents, on-device assistants, anything where a user is watching tokens appear. Run it if your data cannot leave your hardware and you were going to self-host regardless, since you may as well self-host the fast one. And run it if you have a bounded task you can fine-tune for, where the Sudoku-style jump from near-zero to 80 percent suggests the architecture punches above its base scores.

Skip it for everything else. If you want Gemma 4 quality, rent the autoregressive version for thirty cents and move on. If you want hosted diffusion speed without running infrastructure, Mercury 2 is sitting right there. DiffusionGemma is not a cheaper way to get the same answer; it is a faster way to get a slightly worse one, and that trade only closes when you own the hardware and keep it busy. Google built an interesting research artifact and was honest about its limits. The cost math just asks you to be honest about your utilization before you reach for it. You can line up the hosted alternatives on our pricing page and run your own volume through the cost calculator before committing a GPU to it.

Sources

Google: DiffusionGemma for faster text generation - 4x speed claim, 1,000+ tok/s on H100, 18GB VRAM, explicit quality caveat
Hugging Face: google/diffusiongemma-26B-A4B-it model card - 25.2B/3.8B params, benchmark table vs Gemma 4, NVFP4 quantization
NVIDIA: Run DiffusionGemma for high-throughput generation - DGX Spark and DGX Station throughput, 256K context, NIM packaging
Google Developers: DiffusionGemma developer guide - denoising-step mechanics, Sudoku fine-tuning result, vLLM support
MarkTechPost: DiffusionGemma release coverage - June 10 release date, parallel-canvas explanation
OpenRouter: Mercury 2 - hosted diffusion comparison at $0.25/$0.75 per 1M, 128K context

Compare all model prices Calculate your API cost