DiffusionGemma generates 4x faster than its own sibling and scores worse on every benchmark Google published.
Google shipped DiffusionGemma 26B on June 10, an experimental open-weight model that writes text by denoising a canvas in parallel instead of one token at a time. It hits over 1,000 tokens per second on a single H100. It also has no price tag, because Google isn't selling it by the token. The awkward part: the autoregressive Gemma 4 it is built on is both smarter and already rentable for thirty cents a million. So we worked out the one place the speed actually earns its keep.

Image source: Google
Three things to hold in your head before the tables start:
- It is a 25.2B-parameter MoE (3.8B active), Apache 2.0, with no per-token rate. You run it yourself or you do not run it.
- The selling point is raw throughput: 1,000-plus tokens a second on a single H100, up to 4x the autoregressive Gemma 4 it shares a backbone with.
- Google says, in writing, that it scores lower than that same Gemma 4 on every benchmark. So you are buying speed with accuracy, and that only pays back if you own the GPU and keep it busy.
A Gemma that paints the answer instead of typing it
Every model you have priced on this site so far is autoregressive. It predicts the next token, appends it, and predicts again, left to right, one step per token. DiffusionGemma works differently. It starts with a 256-token canvas of placeholder tokens and runs roughly 12 to 16 denoising passes over the whole block at once, locking in high-confidence tokens and using them to refine the rest. For longer outputs it commits a finished block to the cache and opens a fresh canvas. The practical effect is that the number of forward passes barely grows with output length, which is where the speed comes from.
The base is the same Gemma 4 26B A4B mixture-of-experts model Google shipped in April, with a diffusion head bolted on. It carries a 256K-token context window, takes text, image, and video input, and writes text out. It does not handle audio. Google labels it experimental and ships it in 18 quantized variants, with the native NVFP4 four-bit format squeezing the whole thing into 18GB of VRAM. That is the genuinely notable part for cost: an RTX 5090 has enough memory to run a model this capable at this speed on a desk.
| Spec | DiffusionGemma 26B |
|---|---|
| Parameters | 25.2B total, 3.8B active (MoE) |
| License | Apache 2.0 (open weights) |
| Context window | 256K tokens |
| Modality | Text, image, video in; text out (no audio) |
| VRAM (NVFP4) | Fits within 18GB |
| Per-token API price | None - self-host only |
The speed is real, and it depends entirely on the box
Google's throughput numbers are tied to specific hardware, which matters because self-hosting cost is throughput divided into the hourly GPU rate. The faster the chip, the cheaper each token. Here is what Google and NVIDIA report across the lineup.
| Hardware | Throughput | Tier |
|---|---|---|
| NVIDIA DGX Station | up to 2,000 tok/s | Workstation |
| H100 (single) | 1,000+ tok/s | Datacenter |
| RTX 5090 | 700+ tok/s | Consumer |
| NVIDIA DGX Spark | up to 150 tok/s | Desktop AI |
The 4x figure Google quotes is measured against the autoregressive Gemma 4 26B A4B running normally, not against a competitor. Read the fine print and one caveat keeps recurring: the speedup is strongest at low to medium batch sizes on a single accelerator and gives diminishing returns under high-concurrency cloud serving. In other words, the diffusion advantage is a single-user, low-batch phenomenon. It makes an interactive local model feel instant. It does not automatically make a busy multi-tenant API cheaper.
What you give up: every benchmark on the card
Most model launches bury the regressions. Google did not. The DiffusionGemma announcement states plainly that its output quality is lower than standard Gemma 4, and recommends deploying standard Gemma 4 for any application that demands maximum quality. The model card backs that up with a head-to-head against the exact autoregressive sibling it was distilled from.
| Benchmark | DiffusionGemma | Gemma 4 26B A4B | Gap |
|---|---|---|---|
| MMLU Pro | 77.6 | 82.6 | -5.0 |
| GPQA Diamond | 73.2 | 82.3 | -9.1 |
| LiveCodeBench v6 | 69.1 | 77.1 | -8.0 |
| MATH-Vision | 70.5 | 82.4 | -11.9 |
The gap widens as the task gets harder. Five points on MMLU Pro is the kind of difference you might shrug off for a 4x speedup. Nine points on graduate-level science and twelve on visual math are not. There is one bright spot Google highlights: DiffusionGemma is unusually responsive to fine-tuning. A base model that scored near zero on Sudoku reached 80 percent after task-specific tuning, solving puzzles in 12 denoising steps. The takeaway is that the off-the-shelf scores understate what the architecture can do on a narrow, well-defined task you train it for.
No price tag means you become the pricing department
When a model has no per-token rate, the cost does not disappear. It moves onto your cloud bill as GPU-hours, and you convert it back to dollars per million tokens yourself. The formula is simple: take the hourly rate of the GPU, divide by the tokens it produces per hour. At 1,000 tokens per second, an H100 turns out 3.6 million tokens in an hour of solid generation. Here is where that lands across common H100 rentals.
| H100 source | Rate / hr | Cost / 1M output* |
|---|---|---|
| Vast.ai (marketplace low) | $0.67 | ~$0.19 |
| RunPod H100 PCIe | $1.99 | ~$0.55 |
| RunPod H100 SXM | $2.69 | ~$0.68 |
| Lambda H100 | $3.29 | ~$0.91 |
*These assume the GPU runs at full throughput continuously, generating a single stream. That is the best case and it almost never holds. The moment your box sits idle between requests, the effective cost per token climbs, because you pay for the hour whether or not it produces tokens. A node busy half the day doubles every figure in that last column. Input-token processing, batching gains, and setup overhead all push in different directions. Treat these as the floor, not the bill.
We have walked this self-host-versus-rent question before with Gemma 4 12B, and the conclusion rhymes: open weights are only free if your time and your idle GPUs are free.
The comparison Google would rather you not run
Put DiffusionGemma next to two models it competes with directly, and the picture gets uncomfortable. One is its own autoregressive sibling, which you can rent hosted on OpenRouter. The other is Mercury 2, the hosted diffusion LLM we covered in May. Both of those have a fixed price, no GPU to babysit, and in one case a higher benchmark score.
| Model | Access | Output / 1M | Speed | Context |
|---|---|---|---|---|
| DiffusionGemma 26B | Self-host | ~$0.19-0.91* | 1,000+ tok/s | 256K |
| Gemma 4 26B A4B | Hosted (OpenRouter) | $0.30 | autoregressive | 256K |
| Mercury 2 | Hosted API | $0.75 | ~1,009 tok/s | 128K |
Look at the middle row. Gemma 4 26B A4B, the autoregressive model DiffusionGemma is distilled from, lists at $0.30 per million output tokens hosted, scores higher on every benchmark, and costs you nothing to operate. At the best-case self-host figures, DiffusionGemma only undercuts it on the marketplace-spot end, and even then you are absorbing all the idle-GPU risk to save maybe a dime per million while accepting a quality drop. The only thing you cannot buy from the hosted sibling is latency: 1,000 tokens per second on a box you control.
Mercury 2 is the cleaner comparison for the speed-first crowd. It matches DiffusionGemma's throughput (Artificial Analysis clocks it around 1,000 tok/s), it is a managed endpoint with no infrastructure, and at $0.75 per million output it prices in line with a mid-tier H100 rental running the open model. If your traffic is bursty, Mercury 2 wins on simplicity alone, because you never pay for an idle GPU. DiffusionGemma's edge appears only when traffic is steady, data has to stay on your own hardware, or you have fine-tuned it for a specific task where the base benchmark gap stops mattering.
Who this is actually for
Strip away the launch noise and DiffusionGemma fits a narrow, real slot. Run it if you operate your own GPUs at high utilization and latency is a product feature, not a nice to have: live autocomplete, streaming agents, on-device assistants, anything where a user is watching tokens appear. Run it if your data cannot leave your hardware and you were going to self-host regardless, since you may as well self-host the fast one. And run it if you have a bounded task you can fine-tune for, where the Sudoku-style jump from near-zero to 80 percent suggests the architecture punches above its base scores.
Skip it for everything else. If you want Gemma 4 quality, rent the autoregressive version for thirty cents and move on. If you want hosted diffusion speed without running infrastructure, Mercury 2 is sitting right there. DiffusionGemma is not a cheaper way to get the same answer; it is a faster way to get a slightly worse one, and that trade only closes when you own the hardware and keep it busy. Google built an interesting research artifact and was honest about its limits. The cost math just asks you to be honest about your utilization before you reach for it. You can line up the hosted alternatives on our pricing page and run your own volume through the cost calculator before committing a GPU to it.
Sources
- Google: DiffusionGemma for faster text generation - 4x speed claim, 1,000+ tok/s on H100, 18GB VRAM, explicit quality caveat
- Hugging Face: google/diffusiongemma-26B-A4B-it model card - 25.2B/3.8B params, benchmark table vs Gemma 4, NVFP4 quantization
- NVIDIA: Run DiffusionGemma for high-throughput generation - DGX Spark and DGX Station throughput, 256K context, NIM packaging
- Google Developers: DiffusionGemma developer guide - denoising-step mechanics, Sudoku fine-tuning result, vLLM support
- MarkTechPost: DiffusionGemma release coverage - June 10 release date, parallel-canvas explanation
- OpenRouter: Mercury 2 - hosted diffusion comparison at $0.25/$0.75 per 1M, 128K context